In this project, a dataset of about 82000 rows and 20 columns was given regarding the reviews of various board games available. A supervised machine learning model as then been trained and evaluated to predict the ratings given to a particular board game based on several features of the datasets.
In order to choose the machine learning algorithms best suited to make predictions of the average rating of the board games, number of regression or similar models were considered and the data has been investigated to choose the best possible models:
Considering that the problem is to predict a numerical value several options such as follows could be chosen:
Can be chosen if the variables show a linear correlation with the label. If the correlation is not linear, the linear regression model would not be accurate.
A Decision Tree is an intuitive model where by one traverses down the branches of the tree and selects the next branch to go down based on a decision at a node.While building the tree, the goal is to split on the attributes which create the purest child nodes possible, which would keep to a minimum the number of splits that would need to be made in order to classify all instances in our dataset. Purity is measured by the concept of information gain, which relates to how much would need to be known about a previously-unseen instance in order for it to be properly classified. Random Forests are simply an ensemble of decision trees. The input vector is run through multiple decision trees. For regression, the output value of all the trees is averaged; for classification a voting scheme is used to determine the final class. Great at learning complex, highly non-linear relationships.Very easy to interpret and understand. can be prone to major overfitting.Using larger random forest ensembles to achieve higher performance comes with the drawbacks of being slower and requiring more memory.
A Neural Network consists of an interconnected group of nodes called neurons. The input feature variables from the data are passed to these neurons as a multi-variable linear combination, where the values multiplied by each feature variable are known as weights. A non-linearity is then applied to this linear combination which gives the neural network the ability to model complex non-linear relationships. A neural network can have multiple layers where the output of one layer is passed to the next one in the same way. At the output, there is generally no non-linearity applied. Neural Networks are trained using Stochastic Gradient Descent (SGD) and the backpropagation algorithm. They are very effective for data with complex non-linear relationships with negligible consideration to the structure of the data. However, these models could be difficult to interpret and computationally challenging.
import sys
import pandas
import sklearn
import matplotlib
import seaborn
print(sys.version)
print(pandas.__version__)
print(matplotlib.__version__)
print(seaborn.__version__)
print(sklearn.__version__)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
#Loading the data
game = pandas.read_csv("games.csv")
# Observing the shape, columns and some rows of the dataframe
print(game.shape)
print(game.columns)
game.head
plt.hist(game['average_rating'])
As can be observed from the above histogram, a majority of the games in the dataframe are given an average rating of zero. These rows need to be closely observe to determine the reason for the zero rating.
# printing the 10 of the rows which has 0 average rating and also printing those which have average rating greater than 0
print(game[game["average_rating"] == 0].iloc[0:10])
print(game[game["average_rating"] > 0].iloc[0:10])
Thus, from observing the rows with 0 average rating it can be reasonably concluded that all the games which have 0 rating have 0 users rated. Thus, for all the games which have not been played or published or not rated the average rating showed up to be 0. Thus, these rows could be removed from the data frame. Further, if any missing values are prevalent in the dataframe, those rows must also be removed.
# Finding the number of missing values in each column
game.isnull().sum()
Thus, it can be observed that a total number of missing values is 41 which could be removed from the dataframe.
# Removing the rows with missing values
game = game.dropna(axis = 0)
# Removing rows with 0 user reviews
game = game[game["users_rated"] > 0]
#Plotting the histogram again
plt.hist(game["average_rating"])
In order to know if there are any strong correlations prevalent in the dataset, correlation matrix has been plotted as follows
# Correlation Matrix
corrmat = game.corr()
fig = plt.figure(figsize =(12,9))
sns.heatmap(corrmat, vmax =0.8, square = True)
plt.show()
From the correlation matrix the correlation between values of different columns could be established. It can be seen that if the square colour is lighter (towards white), the columns were highly correlated and darker implies no correlation. Moreover, columns which are collinear showed maximum correlation value like the 'playingtime', 'minplaytime' and 'maxplaytime' and 'average_rating' and 'bayes_average_rating' Further, some of the columns like 'type', 'name' and 'yearpublished' could be removed rightaway as they provide negligible information pertaining to the predictor 'average_rating'. Lastly, columns such as 'id' and 'bayes_average_rating' must be removed as the high correlation with the average_rating and collinearity respectively would adversely affect the machine learning model.
In order to ascertain the type of relationship the label 'average_rating' has with the determined variables, a scatter plot matrix between each variable column and the label column could be plotted
game
First, the columns to be chosen as variables and the target are converted into a list. This is critical because, sklearn package is not able to work with dataframes.
Second, the columns mentioned above are removed from the lists of columns created from the dataframe.
Thirdly, the dataframe is split randomly into test and training dataframes.
Finally, the values of the column to be predicted (in this case 'average_rating') is saved as a seperate list and the columns chosen as the variables were saved seperately.
columns = game.columns.tolist()
# Filtering the columns to be removed
columns = [c for c in columns if c not in ['id', 'name', 'type', 'average_rating', 'bayes_average_rating', 'yearpublished']]
target = 'average_rating'
# Separating the variables and target and storing as arrays
X_Var = game[columns].values
Y_tar = game[target].values
print(X_Var)
print(Y_tar)
# Plotting scatter plots between
for c in columns:
plt.scatter(game[c],game[target], alpha = 0.4)
plt.xlabel(c)
plt.ylabel(target)
plt.show()
Judging from the previous plots, following observations and changes could be made in the future if there is any improvement to be made to the model:
# Splitting the above obtained arrays into testing and training arrays
X_train, X_test, Y_train, Y_test = train_test_split(X_Var, Y_tar, test_size = 0.2, random_state = 53)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
Thus, the test and training datasets were chosen using the train_test_split method and the models would be trained and evaluated using these data sets.
Chosing to compare between multivariate linear regression, decision forest/random forest regression (and neural network regression models), each model was trained using the training set and cross validated on the training set before evaluating using the testing set
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
LR = LinearRegression()
RFR = RandomForestRegressor(n_estimators = 100, min_samples_leaf = 10, random_state = 1)
LR.fit(X_train,Y_train)
LR_prediction = LR.predict(X_test)
mse_LR = mean_squared_error(LR_prediction, Y_test)
RFR.fit(X_train,Y_train)
RFR_prediction = RFR.predict(X_test)
mse_RFR = mean_squared_error(RFR_prediction, Y_test)
print('Mean Square Error for Linear Regression Model is {}', mse_LR)
print('Mean Square Error for Linear Regression Model is {}', mse_RFR)