Recommendation Systems for Amazon Books¶

In [1]:
# !pip install -q scikit-surprise

Overview:¶

In this project, you will build recommemder systems using different approaches.

We will be using the Amazon Review data: https://nijianmo.github.io/amazon/index.html

Note this is a fairly large dataset. We will use the Book review from "Small" subsets for experimentation and use ratings only(51,311,621 ratings) dataset, which is listed at the last section of the document. However you are required to read through the entire document before proceeding.

This is a fairely large dataset, even though we reduced the data size at various stage, you should write your program with computational cost (both spatial and temporal) in mind. 5% of grades go to efficiency in programing.

  • In general when you want to discard a python object when you no longer need it (automatical garbage collection will do it also but probably not in time):
del some_data_frame
  • In general a vectorized function runs much efficiently than for loop:
    # slow:
    total =0
    for data in dataframe["grades"]:
        if data>5: total+=data
    print(total)
    # much faster:
    print(
        dataframe["grades"][dataframe["grades"]>5].sum()
)

Task one: Data ingest (10pts)¶

Please download the dataset first. If you expect you are going to work on this dataset for a period of several days (which is expected), you can mount on your google drive by clicking the folder icon then Google drive icon (the third one on top after clicking the folder icon) from left menu of colab, your drive will be mounted as "/content/drive/MyDrive". Now you can use linux copy command (reference cheatsheet or search for it) to copy the file to your drive. Next time you can just copy it back when you start a new colab session. This dataset contains book reviews from Amazon, with below columns:

  • User ID
  • Book ID
  • Rating (1 to 5)
  • Date they gave the ratings(unix timestamp, google for how to convert to datetime object. For validation, 1123804800 should be converted to 2005-08-12 00:00:00)

Because of the large data size, you want to create a small dataset using first 10,000 lines (reference the spark/docker tutorial and project guide for doing this), and play with this file until your code runs ok, then you switch to the full size dataset. Let's assume the small dataset is books10m.csv

In [2]:
import pandas as pd
import gc
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import os
import sqlite3
from scipy.sparse import csr_matrix
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import Reader

import warnings
from surprise import accuracy
from surprise.model_selection import KFold
import pickle
import io
import random
import gc

os.chdir(r"D:\MyDrive2\pythonprojects\class2\DataMining\homework\final_project")
In [3]:
df_book = pd.read_csv(r"E:\PythonProjects\datamining\final_project\data\books.csv")
In [4]:
df_book = df_book.rename(columns={
    df_book.columns[0]: "user_id",
    df_book.columns[1]: "book_id",
    df_book.columns[2]: "rating",
    df_book.columns[3]: "date"
})
In [5]:
df_book['date'] = pd.to_datetime(df_book['date'], unit='s')
In [6]:
df_book.head()
Out[6]:
user_id book_id rating date
0 0001713353 A1REUF3A1YCPHM 5.0 2005-03-30
1 0001713353 A1YRBRK2XM5D5 5.0 2004-04-04
2 0001713353 A1V8ZR5P78P4ZU 5.0 2004-02-21
3 0001713353 A2ZB06582NXCIV 5.0 2016-10-03
4 0001713353 ACPQVNRD3Z09X 5.0 2016-07-29
In [7]:
df_book.iloc[0, :]
Out[7]:
user_id             0001713353
book_id         A1REUF3A1YCPHM
rating                     5.0
date       2005-03-30 00:00:00
Name: 0, dtype: object
In [8]:
df_book.dtypes, df_book.shape
Out[8]:
(user_id            object
 book_id            object
 rating            float64
 date       datetime64[ns]
 dtype: object,
 (51311620, 4))

Tried several ways to load the whole json file into memory.¶

In [9]:
# import gzip
# import shutil
#
# # Path to your .gz file
# input_path = r"E:\PythonProjects\datamining\final_project\data\Books_5.json.gz"
# # Path for the output file (remove .gz extension)
# output_path = r"E:\PythonProjects\datamining\final_project\data\Books_5.json"
#
# with gzip.open(input_path, 'rb') as f_in:
#     with open(output_path, 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
In [10]:
# df_rating = pd.read_json(r"E:\PythonProjects\datamining\final_project\data\Books_5.json.gz", compression='gzip',lines=True)
# df_rating = pd.read_json(r"E:\PythonProjects\datamining\final_project\data\Books_5.json", lines=True)
In [11]:
# import json
# import pprint
#
# with open(r"E:\PythonProjects\datamining\final_project\data\Books_5.json", 'r', encoding="utf-8",errors="ignore") as f:
#     lines = f.readlines()
#     print(f"Total lines: {len(lines)}")
#     for line in lines:
#         print(line)
#         jline = json.loads(line)
#         pprint.pprint(jline)
#         break
In [12]:
# chunks = []
# chunksize = 100000  # Adjust based on your available memory
# for chunk in pd.read_json(r"E:\PythonProjects\datamining\final_project\data\Books_5.json",
#                          lines=True,
#                          chunksize=chunksize):
#     chunks.append(chunk)
#
# df_rating = pd.concat(chunks, ignore_index=True)
In [13]:
# import numpy as np
# from scipy.sparse import csr_matrix
# import pandas as pd
#
# # Read the JSON data in chunks to handle large file size
# chunks = []
# chunksize = 100000
# for chunk in pd.read_json(r"E:\PythonProjects\datamining\final_project\data\Books_5.json",
#                          lines=True,
#                          chunksize=chunksize):
#     chunks.append(chunk)
#
# # Combine all chunks into one dataframe
# df_rating = pd.concat(chunks, ignore_index=True)
#
# # Create user and item mappings to convert IDs to indices
# unique_users = df_rating['reviewerID'].unique()
# unique_books = df_rating['asin'].unique()
# user_to_idx = {user: idx for idx, user in enumerate(unique_users)}
# book_to_idx = {book: idx for idx, book in enumerate(unique_books)}
#
# # Convert user and item IDs to indices
# user_indices = [user_to_idx[user] for user in df_rating['reviewerID']]
# book_indices = [book_to_idx[book] for book in df_rating['asin']]
# ratings = df_rating['overall'].values
#
# # Create the sparse CSR matrix
# rating_matrix = csr_matrix((ratings, (user_indices, book_indices)),
#                           shape=(len(unique_users), len(unique_books)))

Task Two: Reservoir Sampling (this involves some algorithms we have not covered, if feeling confused you can temporarily skip this step before the relevant lecture) (15pts)¶

Write a function Reservoir_Sampling:

def Reservoir_Sampling(infile_name, outfile_name, k):
    # your code

It takes the input file name infile_name, which in this case is "Books.csv" that you downloaded and sample k lines. We use the input file to mimic streaming data by read it line by line: you want to uniformly sample from the streaming data which is effectively infinitely long, you want to make sure that at any moment, for the n data points you received so far, you keep k points that are uniformly sampled from these n points. Note n is growing over time and is endlessly growing.

We will be implementing an algorithm called Reservoir Sampling to do this, which

  • if n<=k we keep all incoming data points from the stream in a list KEEP;
  • when a new data point that is $n^{th}$ comes, if n>k, roll a dice that gives the new data point a probability of k/n to be appended to KEEP;
    • if the $n^{th}$ new data point is to be added to KEEP and if doing so would cause len(KEEP)==k+1, we roll another dice to randomly pick a point from KEEP (before appending) to throw away, each data point in KEEP has equal probability to be discarded.

The idea is to read the input file line by line and use "Reservoir Sampling" algorithm to keep K lines. When all lines from the input file is processed, write the KEEP to output file outfile_name. You will be building recommender system using this randomly sampled file.

This project asks k=10,000,000 (10million, you can use a smaller number before submission):

Following code should create the sampled file `books10m.csv' with 10m lines.

Reservoir_Sampling('Books.csv', 'books10m.csv', 10000000)

If you don't know how to read text file line by line, consult python cheatsheet or search for it.

In [14]:
# Reset IPython history
# %reset -f


def Reservoir_Sampling(infile_name, outfile_name, k):
    # Increase default buffer size for better performance with large files
    buffer_size = 8 * 1024 * 1024  # 8MB buffer

    with io.open(infile_name, 'rb', buffering=buffer_size) as infile, \
            io.open(outfile_name, 'wb', buffering=buffer_size) as outfile:

        KEEP = []  # Reservoir storage
        n = 0  # Counter for total lines processed

        # Process input file line by line
        for line in infile:
            n += 1

            if n <= k:
                # Fill reservoir until k elements
                KEEP.append(line)
            else:
                # For nth element, pick it with probability k/n
                j = random.randint(0, n - 1)
                if j < k:
                    KEEP[j] = line

        # Write all kept lines to output file
        outfile.writelines(KEEP)

Run code below after your function is done¶

You should be able to generate a new file 'books10m.csv' after running code below, it samples from "Books.csv" and has 10million lines.

In [15]:
# DO NOT MODIFY THIS CELL
Reservoir_Sampling(r'E:\PythonProjects\datamining\final_project\data\Books.csv',
                   r'E:\PythonProjects\datamining\final_project\data\books10m.csv', 10000000)
In [16]:
# Save a even more smaller file books10k.csv for quick test
Reservoir_Sampling(r'E:\PythonProjects\datamining\final_project\data\Books.csv',
                   r'E:\PythonProjects\datamining\final_project\data\books10k.csv', 10000)
In [17]:
df_book = pd.read_csv(r"E:\PythonProjects\datamining\final_project\data\books10m.csv", usecols=[0, 1, 2, 3])
df_book = df_book.rename(columns={
    df_book.columns[0]: "user_id",
    df_book.columns[1]: "book_id",
    df_book.columns[2]: "rating",
    df_book.columns[3]: "date"
})
df_book['date'] = pd.to_datetime(df_book['date'], unit='s')
df_book.shape
Out[17]:
(9999999, 4)

Task Three: Data Exploration (15pts)¶

Please explore your data with appropriate plots and tables/printouts: total counts, value/counts distributions of user-item interactions, time dependency of the statistics etc. (anything related to a good insight into the data and helpful for the predictions)

In [18]:
df_book
Out[18]:
user_id book_id rating date
0 0804138656 A5SPCIGNLRUH9 5.0 2015-05-23
1 1477830138 A39CRGVP6S5IGR 5.0 2015-09-15
2 0932194540 A2C13HP7R5RIJ9 5.0 2010-12-21
3 038531972X A34VFF6N5QMPUC 5.0 2018-01-11
4 0553807404 A79SSJPXFXUH3 5.0 2014-07-02
... ... ... ... ...
9999994 193590437X A2ZIT1GZZR980D 5.0 2012-10-29
9999995 0800721225 AO6NHS4MBHSZ9 2.0 2013-01-20
9999996 1456337688 A2UTBYYF2SJFXN 5.0 2011-03-17
9999997 1975926773 ADQFI9RB4T1VL 5.0 2018-03-25
9999998 1593639589 A3MW3GO7LPQ39J 5.0 2014-07-22

9999999 rows × 4 columns

In [19]:
# Your code
total_rating_number = len(df_book)
total_user_number = len(df_book.user_id.unique())
total_book_number = len(df_book.book_id.unique())
global_average_rating = df_book.rating.mean()
user_average_rating = df_book.groupby('user_id').mean('rating')
book_average_rating = df_book.groupby('book_id').mean('rating')
matrix_density = total_rating_number / (total_user_number * total_book_number)
In [20]:
print(f"Total Rating Number: {total_rating_number}")
print(f"Total User Number: {total_user_number}")
print(f"Total Book Number: {total_book_number}")
print(f"Global Average Rating: {global_average_rating:.2f}")
print(f"Matrix Density: {matrix_density:.4%}")
Total Rating Number: 9999999
Total User Number: 1588262
Total Book Number: 5345280
Global Average Rating: 4.39
Matrix Density: 0.0001%
In [21]:
df_book.rating.value_counts()
Out[21]:
rating
5.0    6620794
4.0    1863196
3.0     747807
1.0     406772
2.0     361429
0.0          1
Name: count, dtype: int64
In [22]:
df_book.date.describe()
Out[22]:
count                          9999999
mean     2014-04-29 17:11:00.764587264
min                1996-05-31 00:00:00
25%                2013-05-19 00:00:00
50%                2015-02-12 00:00:00
75%                2016-08-25 00:00:00
max                2018-10-01 00:00:00
Name: date, dtype: object
In [23]:
df_book['year'] = df_book['date'].dt.year
years = sorted(df_book['year'].unique())
In [24]:
plt.figure(figsize=(18, 5))
for idx, year in enumerate(years):
    year_data = df_book[df_book['year'] == year]['rating']

    # Create KDE plot for each year with offset
    sns.kdeplot(data=year_data,
                fill=False,
                alpha=0.5,
                linewidth=1.5,
                bw_adjust=1,
                label=f'Year {year}')
plt.legend()
Out[24]:
<matplotlib.legend.Legend at 0x1d73204b0b0>
No description has been provided for this image

Task Four: Triming your data (15pts)¶

Since the dataset is pretty sparse. Please trim down the size of the data by:

  • removing data before '2010-01-01 00:00:00'←, let's focus on more recent data

  • removing books that received less than or equal to 2 user ratings

  • removing users that reviewed less than or equal to m books, where m is 80% quantile of user ratings count (i.e., variable to consider for quantile is the total count of ratings by each user) -- we keep only the top 20% user who are active in rating books.

Combine the above steps into a function trim(book_data_frame) which takes an arguement of input dataframe and returns trimed dataframe

Call trim() to trim your dataset.

Print out the statistics before and after triming for the total users and books and dataframe shape.

In [25]:
# filter by time
df_book = df_book[df_book['date'] > '2010-01-01 00:00:00']
df_book.shape
Out[25]:
(8959639, 5)
In [26]:
#filter by rate count
df_book_cnt = df_book.groupby('book_id')['rating'].count()
df_book_cnt = df_book_cnt[df_book_cnt > 2]
df_book_cnt.shape
Out[26]:
(644976,)
In [27]:
#filter by user rate count
df_user_cnt = df_book.groupby('user_id')['rating'].count()
df_user_cnt = df_user_cnt.sort_values(ascending=False)[:int(len(df_user_cnt) * 0.2)]
df_user_cnt.shape
Out[27]:
(280940,)
In [28]:
def trim(book_data_frame: pd.DataFrame) -> pd.DataFrame:
    # filter by time
    # df_book.iloc[0,:]['date'].timestamp()
    # dt = datetime.strptime('2010-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
    # timestamp = int(dt.timestamp())
    # print(timestamp)
    # df_date = book_data_frame[book_data_frame['date'].apply(lambda x: x.timestamp()) > timestamp]
    df_date = book_data_frame[book_data_frame['date'] > '2010-01-01 00:00:00']

    #filter by rate count
    df_book_cnt = df_date.groupby('book_id')['rating'].count()
    df_book_cnt = df_book_cnt[df_book_cnt > 2]

    #filter by user rate count
    df_user_cnt = df_date.groupby('user_id')['rating'].count()
    df_user_cnt = df_user_cnt.sort_values(ascending=False)[:int(len(df_user_cnt) * 0.2)]

    result = book_data_frame[book_data_frame['user_id'].isin(df_user_cnt.index)]
    result = result[result['book_id'].isin(df_book_cnt.index)]
    return result
In [29]:
df_book = trim(df_book)
In [30]:
total_rating_number = len(df_book)
total_user_number = len(df_book.user_id.unique())
total_book_number = len(df_book.book_id.unique())
global_average_rating = df_book.rating.mean()
user_average_rating = df_book.groupby('user_id').mean('rating')
book_average_rating = df_book.groupby('book_id').mean('rating')
matrix_density = total_rating_number / (total_user_number * total_book_number)
In [31]:
print(f"Total Rating Number: {total_rating_number}")
print(f"Total User Number: {total_user_number}")
print(f"Total Book Number: {total_book_number}")
print(f"Global Average Rating: {global_average_rating:.2f}")
print(f"Matrix Density: {matrix_density:.4%}")
Total Rating Number: 3288353
Total User Number: 268132
Total Book Number: 629094
Global Average Rating: 4.39
Matrix Density: 0.0019%
In [32]:
# Save the checkpoint
# Comment to avoid overwriting
# pickle.dump(df_book, open(r"E:\PythonProjects\datamining\final_project\data\book_trimed.pkl", "wb"))

Task Five: fit the model using the data. (20pts)¶

You should prepare your data for surprise library. Check document here for how to read data from pandas data frame or from text file--you need to make sure your data frame or text file follow the required format as shown in the example code.

Then you should run code below, it takes less than 30min to finish for 'books10m.csv' data set after trimming. Basically it fits the data with latent factor decomposition (SVD), that factorize the user-item matrix into latent factor matrices P and Q. The dimension (number of latent features) of P and Q is specified by n_factors, which defaults to 100. See ?SVD for more options and details. The code chunk would run the SVD 3 times (called k-fold cross validation, k=3 here), each with a random split of the data into 20% for test, 80% for training, and RMS error is printed out for each of 3 runs.

Read documentation for parameters in SVD(): https://surprise.readthedocs.io/en/stable/matrix_factorization.html?highlight=lr_bu#surprise.prediction_algorithms.matrix_factorization.SVD.bu

  • You should run the code with default call to SVD(), as well as with n_factors=50, and 150), and combined with lr_all=.01 and 0.1; reg_all=.01, 0.1 (make sure you understand what these parameter mean, get a better understanding by relating these to equations in the lecture slides), respectively.
  • Compare both the RMSE and running time for each case.
  • Missing any of boldfaced requirement causes point deduction
In [33]:
SVD().lr_bu
Out[33]:
0.005
In [34]:
# Your code to prepare data


# Load the dataset
with open(r"E:\PythonProjects\datamining\final_project\data\book_trimed.pkl", 'rb') as file:
    data = pickle.load(file)
data.shape
Out[34]:
(3290537, 5)
In [35]:
data.rating.value_counts()
Out[35]:
rating
5.0    2065271
4.0     739742
3.0     288198
2.0     112775
1.0      84550
0.0          1
Name: count, dtype: int64
In [36]:
data = data[data['rating'] > 0]
In [37]:
reader = Reader(rating_scale=(1, 5))  # Adjust rating scale as needed
data_set = Dataset.load_from_df(data[['user_id', 'book_id', 'rating']], reader)

Run code chunk below as is after data is loaded.¶

In [38]:
# define a cross-validation iterator
kf = KFold(n_splits=3)
In [39]:
algo = SVD()
for trainset, testset in kf.split(data_set):
    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)
RMSE: 0.8878
RMSE: 0.8886
RMSE: 0.8900

Output example (yours may look differently):

RMSE: 0.8948
RMSE: 0.8936
RMSE: 0.8968

Copy code chunk above and use different parameters for SVD call, then rerun.¶

In [ ]:
 
In [40]:
# your code `n_factors=50, and 150)`, and **combined** with `lr_all=.01 and 0.1`;  `reg_all=.01, 0.1`
import time

lr_all = [0.01, 0.1]
reg_all = [0.01, 0.1]
n_factors = [50, 150]


def format_time(seconds):
    minutes = seconds // 60
    seconds = seconds % 60
    return f"{int(minutes)}m {seconds:.2f}s"


df_rmse = pd.DataFrame(columns=['lr_all', 'reg_all', 'n_factor', 'time', 'rmse'])

for lr in lr_all:
    for reg in reg_all:
        for n_factor in n_factors:
            algo = SVD(n_factors=n_factor, lr_all=lr, reg_all=reg)

            rmses = []
            start_time = time.time()
            for trainset, testset in kf.split(data_set):
                # train and test algorithm.
                algo.fit(trainset)
                predictions = algo.test(testset)

                # Compute and print Root Mean Squared Error
                print(f"lr_all:{lr}, reg_all:{reg}, n_factor:{n_factor}")
                rmse = accuracy.rmse(predictions, verbose=True)
                print(f"RMSE: {rmse:.4f}")
                rmses.append(rmse)

            end_time = time.time()
            fold_time = end_time - start_time
            mean_rmse = np.mean(rmses)
            row = pd.DataFrame(
                {'lr_all': [lr], 'reg_all': [reg], 'n_factor': [n_factor], 'time': [fold_time], 'rmse': [mean_rmse]})
            df_rmse = pd.concat([df_rmse, row], ignore_index=True)
lr_all:0.01, reg_all:0.01, n_factor:50
RMSE: 0.8838
RMSE: 0.8838
lr_all:0.01, reg_all:0.01, n_factor:50
RMSE: 0.8840
RMSE: 0.8840
lr_all:0.01, reg_all:0.01, n_factor:50
RMSE: 0.8858
RMSE: 0.8858
C:\Users\tropi\AppData\Local\Temp\ipykernel_31144\2159502691.py:40: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_rmse = pd.concat([df_rmse, row], ignore_index=True)
lr_all:0.01, reg_all:0.01, n_factor:150
RMSE: 0.8925
RMSE: 0.8925
lr_all:0.01, reg_all:0.01, n_factor:150
RMSE: 0.8924
RMSE: 0.8924
lr_all:0.01, reg_all:0.01, n_factor:150
RMSE: 0.8931
RMSE: 0.8931
lr_all:0.01, reg_all:0.1, n_factor:50
RMSE: 0.8755
RMSE: 0.8755
lr_all:0.01, reg_all:0.1, n_factor:50
RMSE: 0.8767
RMSE: 0.8767
lr_all:0.01, reg_all:0.1, n_factor:50
RMSE: 0.8773
RMSE: 0.8773
lr_all:0.01, reg_all:0.1, n_factor:150
RMSE: 0.8803
RMSE: 0.8803
lr_all:0.01, reg_all:0.1, n_factor:150
RMSE: 0.8810
RMSE: 0.8810
lr_all:0.01, reg_all:0.1, n_factor:150
RMSE: 0.8807
RMSE: 0.8807
lr_all:0.1, reg_all:0.01, n_factor:50
RMSE: 0.8982
RMSE: 0.8982
lr_all:0.1, reg_all:0.01, n_factor:50
RMSE: 0.8988
RMSE: 0.8988
lr_all:0.1, reg_all:0.01, n_factor:50
RMSE: 0.8996
RMSE: 0.8996
lr_all:0.1, reg_all:0.01, n_factor:150
RMSE: 0.8861
RMSE: 0.8861
lr_all:0.1, reg_all:0.01, n_factor:150
RMSE: 0.8859
RMSE: 0.8859
lr_all:0.1, reg_all:0.01, n_factor:150
RMSE: 0.8854
RMSE: 0.8854
lr_all:0.1, reg_all:0.1, n_factor:50
RMSE: 0.8922
RMSE: 0.8922
lr_all:0.1, reg_all:0.1, n_factor:50
RMSE: 0.8922
RMSE: 0.8922
lr_all:0.1, reg_all:0.1, n_factor:50
RMSE: 0.8927
RMSE: 0.8927
lr_all:0.1, reg_all:0.1, n_factor:150
RMSE: 0.8827
RMSE: 0.8827
lr_all:0.1, reg_all:0.1, n_factor:150
RMSE: 0.8824
RMSE: 0.8824
lr_all:0.1, reg_all:0.1, n_factor:150
RMSE: 0.8825
RMSE: 0.8825
In [41]:
# I stored the result in a sqlite db

df_rmse = pd.read_sql("SELECT * FROM rmse", sqlite3.connect("usr_rcs.sqlite"))
In [42]:
df_rmse
Out[42]:
lr_all reg_all n_factor time rmse
0 0.01 0.01 50 111.410039 0.884048
1 0.01 0.01 150 135.617260 0.892240
2 0.01 0.10 50 109.017855 0.876004
3 0.01 0.10 150 138.573165 0.880057
4 0.10 0.01 50 109.442813 0.899201
5 0.10 0.01 150 138.581531 0.885470
6 0.10 0.10 50 110.087006 0.891641
7 0.10 0.10 150 137.073136 0.881697
In [ ]:
 

Visualization in chrome browser¶

In [43]:
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "browser"
In [44]:
# Create interactive 3D scatter plot
fig = px.scatter_3d(
    df_rmse,
    x='reg_all',
    y='n_factor',
    z='rmse',
    color='lr_all',
    title='rmse vs reg_all vs n_factor',
    labels={
        'reg_all': 'reg_all',
        'n_factor': 'n_factor',
        'rmse': 'rmse',
        'lr_all': 'Learning Rate'
    },
    color_discrete_sequence=px.colors.qualitative.Set1,  # Use a discrete color palette
    hover_data={
        'rmse': ':.4f',  # Format RMSE to 4 decimal places
        'reg_all': True,
        'n_factor': True
    },
    # Custom hover template
    hover_name=None,
    opacity=0.6

)

# Update layout for better visualization
fig.update_layout(
    scene=dict(
        xaxis_title='reg_all',
        yaxis_title='n_factor',
        zaxis_title='rmse'
    ),
    width=900,
    height=700,
    margin=dict(l=0, r=0, b=0, t=30)
)

fig.update_traces(
    hovertemplate="<br>".join([
        "RMSE: %{z:.4f}",
        "Regularization: %{x}",
        "n_factor: %{y}",
    ])
)

# Show the plot
fig.show()

Write down your observation and your explain what you see.¶

My observation: The matrix is sparce and maybe noisy, so lower n_factor and higher reg_all is benefacial for the lower RMSE.¶

Task Six: use the model for recommendation (15pts)¶

Use any one of the users from above trainset in the last code chunk (n_factors=150) to predict his/her book ratings. Print the top 10 books that he did not rate but would rate highest as recommended books.

You can reference document here: https://surprise.readthedocs.io/en/stable/getting_started.html#train-on-a-whole-trainset-and-the-predict-method

Read the example code about how to do this. Please note for prediction syntax it doesn't matter if

algo = KNNBasic()

or

algo = SVD()

The syntax for prediction is the same.

rec top 10 books for uid¶

In [45]:
def get_top_10_book(uid: str = "0373244045"):
    book_list = data.book_id.unique()
    # The book this user commented
    user_book = data[data['user_id'] == uid]['book_id'].unique()
    # The book not commented by this user
    user_not_book = [i for i in book_list if not i in user_book]
    # Predict the rating for the books note rated by this user
    user_not_book_rate = [algo.predict(uid, i)[3] for i in user_not_book]
    # Sort the book index by rating in descending order, get the top 10
    top_n_book = np.argsort(user_not_book_rate)[::-1][:10]
    # Get the book names by sorted index top 10
    selected_books = [user_not_book[i] for i in list(top_n_book)]
    return selected_books
In [46]:
get_top_10_book('1500336513')
Out[46]:
['A287QJTE1HFZZ5',
 'A15Y0XJPEGJYTK',
 'A6R7UMU1J68XQ',
 'A2MOAH15GI79HF',
 'ATZESZ8SZOWDF',
 'APBOM5Z9PHPUH',
 'AIR269ZFK02P8',
 'A3HEPG2AS0YG6Z',
 'A5UT8WKIJILKA',
 'A1KF55C7XRPSC9']
In [47]:
data
Out[47]:
user_id book_id rating date year
5 0001713353 AVP0HXC9FG790 5.0 2016-06-20 2016
11 0001713353 A2RE7WG349NV5D 5.0 2015-07-09 2015
15 1451693494 A3518R1F46NWOM 5.0 2014-08-19 2014
21 1502830825 A1FW6L02VSCUYH 5.0 2015-01-04 2015
23 0001713353 A3M314LI9OYME 5.0 2013-10-08 2013
... ... ... ... ... ...
9999984 0553393235 A2L0GFTTI618RU 5.0 2013-10-05 2013
9999987 042526257X AZTFBJTENKT1T 5.0 2015-11-24 2015
9999994 1478227133 A2USTMJDOQPOA2 3.0 2014-09-22 2014
9999995 1490914943 AGE28EEF9PE6U 5.0 2014-06-08 2014
9999998 1500165972 A1TNVUMSPY8BM1 4.0 2014-06-10 2014

3290536 rows × 5 columns

Task Seven: build the SVD from scratch. (20pts, including 10pts as bonus)¶

Reference article here and notebook here for doing SVD/matrix factorization from scratch without using surprise. You can copy his code but you must use our reservoir sampled data from Books.csv, and you can further reduce the data size to be 10% of the rows to speed up computation.

You want to play with his code first and after you understand it you put in our dataset.

You can use the train/test split function from surprise to do train/test split:

from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25)

# train your copied model with our data

Important Note:

  • This is a fairely large dataset. In order to avoid crashing due to memory blowing up, some students in the past resampled the total row number of the ratings to 3489820. This is fine if you do not care about bonus points. For full bonus points, You are allowed to randomly resample at least 30% of the rows after Task Four and then trim again to discard books that receive equal to or less than 2 user ratings (after 30% resampling, some books will lose some user ratings and become rare books again). Trick for memory problems: you want to compute how many GB used by P and Q, when K is, say 200. Thoaw RE TWO largest memomery chunks in the project if you properly program it. Please compute what if you directly create utility matrix (so you know why you should avoid it at any stage of the project). In Colab available memory is ~ 10GB. Set numpy dtype to "float32" (each float takes 4 bytes) will save half of memory usage than default "double"/"float64"

  • In addition, you must use sparse matrix representation to represent the utility matrix to void crashing (refer to previous lab and documentation for csr_matrix in scipy.sparse).

  • You have to figure out how to convert trainset and testset objects into the data format for the MF object. You can use python's dir() function to see methods/attributes of trainset, testset to explore, and the best way is to read suprise documentation.

  • Then you call the get_rating method in MF class to predict ratings from testset, then you compute RMSE using the ratings from testset as ground truth.

In [48]:
class MF():

    def __init__(self, R, K, alpha, beta, iterations):
        """
        Perform matrix factorization to predict empty
        entries in a matrix.

        Arguments
        - R (ndarray)   : user-item rating matrix
        - K (int)       : number of latent dimensions
        - alpha (float) : learning rate
        - beta (float)  : regularization parameter
        """

        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    def train(self):
        # Initialize user and item latent feature matrice
        self.P = np.random.normal(scale=1. / self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1. / self.K, size=(self.num_items, self.K))

        # Initialize the biases
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # Create a list of training samples
        self.samples = [
            (i, j, self.R[i, j])
            for i in range(self.num_users)
            for j in range(self.num_items)
            if self.R[i, j] > 0
        ]

        # Perform stochastic gradient descent for number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
            mse = self.mse()
            training_process.append((i, mse))
            if (i + 1) % 10 == 0:
                print("Iteration: %d ; error = %.4f" % (i + 1, mse))

        return training_process

    def mse(self):
        """
        A function to compute the total mean square error
        """
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    def sgd(self):
        """
        Perform stochastic graident descent
        """
        for i, j, r in self.samples:
            # Computer prediction and error
            prediction = self.get_rating(i, j)
            e = (r - prediction)

            # Update biases
            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_i[j] += self.alpha * (e - self.beta * self.b_i[j])

            # Update user and item latent feature matrices
            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i, :])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j, :])

    def get_rating(self, i, j):
        """
        Get the predicted rating of user i and item j
        """
        prediction = self.b + self.b_u[i] + self.b_i[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    def full_matrix(self):
        """
        Computer the full matrix using the resultant biases, P and Q
        """
        return self.b + self.b_u[:, np.newaxis] + self.b_i[np.newaxis:, ] + self.P.dot(self.Q.T)
In [49]:
R = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
])

mf = MF(R, K=2, alpha=0.1, beta=0.01, iterations=20)
In [50]:
mf.train()
mf.get_rating(2, 3)
mf.mse()
mf.full_matrix()
Iteration: 10 ; error = 0.3487
Iteration: 20 ; error = 0.0627
Out[50]:
array([[4.97741296, 3.01432575, 2.15464326, 1.00159637],
       [4.00816112, 2.56530194, 1.91469885, 1.01899245],
       [1.01628937, 0.98161098, 5.84116035, 4.98098285],
       [0.99735811, 1.03013045, 4.63620617, 3.9806118 ],
       [1.47363508, 1.02646674, 4.97501264, 4.01042782]])

Load data¶

In [51]:
def count_lines(file_path):
    with open(file_path, 'r') as f:
        count = sum(1 for _ in f)
    return count


def print_first_lines(file_path):
    with open(file_path, 'r') as f:
        print(f.readline())


# Use the function
file_path = r"E:\PythonProjects\datamining\final_project\data\books10m.csv"
print_first_lines(file_path)
total_lines = count_lines(file_path)
print(f"Total number of lines: {total_lines}")
0001713353,A1C6M8LCIX4M6M,5.0,1123804800

Total number of lines: 10000000

Commencement of official operations¶

Load data into CSR matrix, getting the rate, user to index, book to index¶

In [52]:
import numpy as np
from scipy.sparse import csr_matrix
import csv


def create_csr_from_csv(file_path, chunk_size=1000000):
    # Dictionaries to map user/book IDs to indices
    user_to_idx = {}
    book_to_idx = {}
    current_user_idx = 0
    current_book_idx = 0

    # Lists to store data for CSR matrix
    rows = []
    cols = []
    data = []

    with open(file_path, 'r') as f:
        reader = csv.reader(f)
        # next(reader)  # Skip header if exists

        chunk = []
        for i, line in enumerate(reader):
            user_id, book_id, rating = line[:3]  # Take first 3 columns, 4th is date which is useless for us

            rating = float(rating)

            if rating <= 0:
                continue

            # Map IDs to indices
            if user_id not in user_to_idx:
                user_to_idx[user_id] = current_user_idx
                current_user_idx += 1
            if book_id not in book_to_idx:
                book_to_idx[book_id] = current_book_idx
                current_book_idx += 1

            # Add to current chunk
            rows.append(user_to_idx[user_id])
            cols.append(book_to_idx[book_id])
            data.append(float(rating))

            # Process chunk if size reached
            if len(rows) % chunk_size == 0:
                print(f"Processed {i + 1} rows")

    # Create the final CSR matrix
    rating_matrix = csr_matrix(
        (data, (rows, cols)),
        shape=(len(user_to_idx), len(book_to_idx))
    )

    print(f"Final matrix shape: {rating_matrix.shape}")
    print(f"Number of non-zero entries: {rating_matrix.nnz}")
    print(f"Sparsity: {rating_matrix.nnz / (rating_matrix.shape[0] * rating_matrix.shape[1]):.4%}")

    return rating_matrix, user_to_idx, book_to_idx


# Use the function
file_path = r"E:\PythonProjects\datamining\final_project\data\books10m.csv"
rating_matrix, user_to_idx, book_to_idx = create_csr_from_csv(file_path)
Processed 1000000 rows
Processed 2000000 rows
Processed 3000000 rows
Processed 4000000 rows
Processed 5000000 rows
Processed 6000000 rows
Processed 7000000 rows
Processed 8000001 rows
Processed 9000001 rows
Final matrix shape: (1588262, 5345281)
Number of non-zero entries: 9988363
Sparsity: 0.0001%

Dump the rating index, user to index, book to index. For future reference¶

In [53]:
# Saving the rating_matrix, user_to_idx, book_to_idx to file
rating_matrix_params = {
    'rating_matrix': rating_matrix,
    'user_to_idx': user_to_idx,
    'book_to_idx': book_to_idx
}
with open('rating_matrix_params.pkl', 'wb') as f:
    pickle.dump(rating_matrix_params, f)
In [54]:
# Loading the rating_matrix, user_to_idx, book_to_idx from file
with open('rating_matrix_params.pkl', 'rb') as f:
    rating_matrix_params = pickle.load(f)

# Access the variables
rating_matrix = rating_matrix_params['rating_matrix']
user_to_idx = rating_matrix_params['user_to_idx']
book_to_idx = rating_matrix_params['book_to_idx']
In [55]:
rating_matrix.data.mean()
Out[55]:
4.3980990678852985

Convert the CSR rating matrix to list of tuples, so that could be easily elevate to GPU¶

In [56]:
from tqdm import tqdm


def gen_samples(R):
    # Get nonzero indices
    i_indices, j_indices = R.nonzero()
    total = len(i_indices)

    # Create samples list with tqdm progress bar
    samples = [
        (i, j, R[i, j])
        for i, j in tqdm(zip(i_indices, j_indices),
                         total=total,
                         desc="Generating samples")
    ]
    return samples


samples = gen_samples(rating_matrix)
Generating samples: 100%|██████████| 9988363/9988363 [03:03<00:00, 54354.72it/s]

Import torch and confirm the existence of calculation hardware¶

In [57]:
import torch

print(f"CUDA: {torch.cuda.is_available()}")


# pip3 install torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
CUDA: True

Drop the sub-tank and get ready to fight¶

In [58]:
if ('mfcsr' in locals() or 'mfcsr' in globals()) and mfcsr is not None:
    del mfcsr
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_max_memory_cached()

gc.collect()
# Check GPU memory usage
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())
D:\PythonProjects\class2_venv2\.venv\Lib\site-packages\torch\cuda\memory.py:489: FutureWarning:

torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.

D:\PythonProjects\class2_venv2\.venv\Lib\site-packages\torch\cuda\memory.py:515: FutureWarning:

torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.

0
0
In [59]:
# First delete any existing tensors/models
if ('mfcsr' in locals() or 'mfcsr' in globals()) and mfcsr is not None:
    del mfcsr

# Clear all PyTorch GPU cached memory
if torch.cuda.is_available():
    # Delete all references to tensors and models
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                if obj.is_cuda:
                    del obj
            elif hasattr(obj, 'cuda'):
                if hasattr(obj, 'is_cuda'):
                    if obj.is_cuda:
                        del obj
        except Exception as e:
            pass

    # Empty cache
    torch.cuda.empty_cache()
    # Reset peak stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_max_memory_cached()

# Force garbage collection multiple times
for _ in range(2):
    gc.collect()

# Check memory status
print(f"Allocated: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1024 ** 2:.2f} MB")
D:\PythonProjects\class2_venv2\.venv\Lib\site-packages\torch\__init__.py:1117: FutureWarning:

`torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead

C:\Users\tropi\AppData\Local\Temp\ipykernel_31144\2758481352.py:13: FutureWarning:

`torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead

D:\PythonProjects\class2_venv2\.venv\Lib\site-packages\torch\cuda\memory.py:489: FutureWarning:

torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.

D:\PythonProjects\class2_venv2\.venv\Lib\site-packages\torch\cuda\memory.py:515: FutureWarning:

torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.

Allocated: 0.00 MB
Reserved: 0.00 MB

Prototyping the battleship¶

In [60]:
# from tqdm import tqdm
# import numpy as np
# import torch


class MF_csr_gpu():
    """
    Matrix Factorization class optimized for CSR matrix format with numerical stability improvements.

    Uses float64 precision and includes safeguards against overflow/underflow conditions.

    Parameters
    ----------
    R : scipy.sparse.csr_matrix
        User-item rating matrix in CSR format
    K : int
        Number of latent dimensions
    alpha : float
        Learning rate (between 0 and 1)
    beta : float
        Regularization parameter
    iterations : int
        Number of iterations for training
    samples : list
        Pre-generated training samples as (user, item, rating) tuples

    Attributes
    ----------
    P : ndarray
        User latent feature matrix
    Q : ndarray
        Item latent feature matrix
    b_u : ndarray
        User biases
    b_i : ndarray
        Item biases
    b : float
        Global bias
    """

    def __init__(self, R, K, alpha, beta, iterations, samples):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = np.float64(alpha)
        self.beta = np.float64(beta)
        self.iterations = iterations
        self.samples = samples

        # Numerical bounds for stability
        self.MIN_RATING = 1.0
        self.MAX_RATING = 5.0
        self.EPSILON = np.finfo(np.float64).eps

        # Check if CUDA is available
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(self.device)
        torch.cuda.empty_cache()

    def train(self):
        # Initialize matrices with float64 dtype
        rng = np.random.default_rng()
        init_scale = np.sqrt(1.0 / self.K)  # More stable initialization

        self.P = rng.normal(0, init_scale, size=(self.num_users, self.K)).astype(np.float64)
        self.Q = rng.normal(0, init_scale, size=(self.num_items, self.K)).astype(np.float64)

        # Initialize biases with float64
        self.b_u = np.zeros(self.num_users, dtype=np.float64)
        self.b_i = np.zeros(self.num_items, dtype=np.float64)
        self.b = np.float64(self.R.data.mean())

        print(f"Number of training samples: {len(self.samples)}")

        # Move data to GPU
        self.b_u = torch.tensor(self.b_u, dtype=torch.float64, device="cuda")
        self.b_i = torch.tensor(self.b_i, dtype=torch.float64, device="cuda")
        self.P = torch.tensor(self.P, dtype=torch.float64, device="cuda")
        self.Q = torch.tensor(self.Q, dtype=torch.float64, device="cuda")
        self.b = torch.tensor(self.b, dtype=torch.float64, device="cuda")
        training_process = []
        for i in range(self.iterations):
            # print(f"shuffling")
            rng.shuffle(self.samples)
            # print(f"Iteration: {i+1}")
            # print("sgd")
            self.sgd()
            # print("mse")
            mse = self.mse_gpu()
            training_process.append((i, mse))
            print("Iteration: %d ; error = %.4f" % (i + 1, mse))

        return training_process

    def mse_gpu(self):
        # xs, ys = self.R.nonzero()

        squared_error = 0.0
        n_ratings = len(self.samples)

        batch_size = 2000000
        for i in range(0, n_ratings, batch_size):
            # print(f"mse,i:{i}")
            samples = torch.tensor(self.samples[i:i + batch_size], dtype=torch.float64,
                                   device="cuda")
            batch_xs = samples[:, 0]
            batch_ys = samples[:, 1]
            all_acctual_ratings = samples[:, 2]

            P_batch = self.P[batch_xs.long()]
            Q_batch = self.Q[batch_ys.long()]

            # Compute dot products for the batch
            dot_products = torch.sum(P_batch * Q_batch, dim=1)
            predictions = self.b + self.b_u[batch_xs.long()] + self.b_i[batch_ys.long()] + dot_products
            # predictions_gpu = torch.clamp(predictions, self.MIN_RATING, self.MAX_RATING)

            # Calculate squared errors for this batch
            batch_errors = all_acctual_ratings - predictions
            batch_squared_error = torch.sum(batch_errors ** 2).item()

            # Add to total squared error
            squared_error += batch_squared_error

        result = np.sqrt(squared_error / (n_ratings + self.EPSILON))

        return result

    def sgd(self):
        """
        Perform stochastic gradient descent with numerical stability improvements.
        """
        # Move data to GPU
        batch_size = 1000000
        for i in range(0, len(self.samples), batch_size):
            # print(f"sgd,i:{i}")
            samples = torch.tensor(self.samples[i:i + batch_size], dtype=torch.float64,
                                   device="cuda")

            # Unpack i, j, r from samples
            i = samples[:, 0].long()  # Convert to long for indexing
            j = samples[:, 1].long()  # Convert to long for indexing
            r = samples[:, 2]  # Ratings

            # Predictions for all samples
            predictions = (
                    torch.sum(self.P[i] * self.Q[j], dim=1) + self.b_u[i] + self.b_i[j] + self.b
            )

            # Compute errors
            errors = r - predictions

            # Gradient updates for biases
            bu_grad = errors - self.beta * self.b_u[i]
            bi_grad = errors - self.beta * self.b_i[j]

            # Update user and item biases
            self.b_u[i] += self.alpha * torch.clamp(bu_grad, -1e3, 1e3)
            self.b_i[j] += self.alpha * torch.clamp(bi_grad, -1e3, 1e3)

            # Gradient updates for latent factors
            P_grad = errors.unsqueeze(1) * self.Q[j] - self.beta * self.P[i]
            Q_grad = errors.unsqueeze(1) * self.P[i] - self.beta * self.Q[j]

            # Update latent factors
            self.P.index_add_(0, i, self.alpha * torch.clamp(P_grad, -1e3, 1e3))
            self.Q.index_add_(0, j, self.alpha * torch.clamp(Q_grad, -1e3, 1e3))

Focusing on the target¶

In [61]:
rating_matrix
Out[61]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9988363 stored elements and shape (1588262, 5345281)>

Instantiation and wrangling with the target¶

In [62]:
mfcsr = MF_csr_gpu(rating_matrix, K=50, alpha=0.001, beta=0.2, iterations=20, samples=samples)
training_process = mfcsr.train()
cuda
Number of training samples: 9988363
Iteration: 1 ; error = 1.0936
Iteration: 2 ; error = 1.0882
Iteration: 3 ; error = 1.0828
Iteration: 4 ; error = 1.0766
Iteration: 5 ; error = 1.0690
Iteration: 6 ; error = 1.0587
Iteration: 7 ; error = 1.0466
Iteration: 8 ; error = 1.0368
Iteration: 9 ; error = 1.0311
Iteration: 10 ; error = 1.0269
Iteration: 11 ; error = 1.0229
Iteration: 12 ; error = 1.0190
Iteration: 13 ; error = 1.0152
Iteration: 14 ; error = 1.0114
Iteration: 15 ; error = 1.0077
Iteration: 16 ; error = 1.0041
Iteration: 17 ; error = 1.0005
Iteration: 18 ; error = 0.9969
Iteration: 19 ; error = 0.9934
Iteration: 20 ; error = 0.9900

Scavenge for loot¶

In [63]:
# Create CPU copies while keeping original CUDA tensors unchanged
b_u_cpu = mfcsr.b_u.cpu().clone()
b_i_cpu = mfcsr.b_i.cpu().clone()
P_cpu = mfcsr.P.cpu().clone()
Q_cpu = mfcsr.Q.cpu().clone()
b_cpu = mfcsr.b.cpu().clone()
In [64]:
# Create a dictionary with all the parameters
# Comment to void overwriting

# model_params = {
#     'b_u': b_u_cpu,
#     'b_i': b_i_cpu,
#     'P': P_cpu,
#     'Q': Q_cpu,
#     'b': b_cpu,
#     'training_process': training_process
# }
#
# # Save the parameters to a file
# with open('model_params.pkl', 'wb') as f:
#     pickle.dump(model_params, f)

Reproducing the Intellectual achievements¶

In [65]:
with open('model_params.pkl', 'rb') as f:
    loaded_params = pickle.load(f)

# Access the variables
b_u_cpu = loaded_params['b_u']
b_i_cpu = loaded_params['b_i']
P_cpu = loaded_params['P']
Q_cpu = loaded_params['Q']
b_cpu = loaded_params['b']
training_process = loaded_params['training_process']

The prophet appears¶

In [66]:
def predict_cpu(user_id: str, book_id: str):
    """
    Predict rating for given user and item with numerical stability checks.

    Parameters
    ----------
    user_id : str
        User ID
    book_id : str
        book ID

    Returns
    -------
    float
        Predicted rating clipped to valid range
    """

    MIN_RATING = 1.0
    MAX_RATING = 5.0

    uid = user_to_idx.get(user_id, None)
    iid = book_to_idx.get(book_id, None)

    if uid is None or iid is None:
        return -1

    dot_product = np.dot(P_cpu[uid, :], Q_cpu[iid, :])
    prediction = b_cpu + b_u_cpu[uid] + b_i_cpu[iid] + dot_product

    # Clip prediction to valid rating range
    return np.clip(prediction, MIN_RATING, MAX_RATING)

chant a spell¶

In [67]:
predict_cpu('1939811104', 'AWH493Q118CB7')
Out[67]:
tensor(4.4173, dtype=torch.float64)
In [ ]: