"All models are wrong, but some are useful" : CLASSIFICATION OF IMDB MOVIES

CLASSIFICATION OF IMDB MOVIES

Introduction

This is the third blogpost with regard to the IMDB movie dataset. The goal for this post is to design a neural net; train it, such that it learns to classify a movie as good, okay, or bad. The annotations of good, okay, and bad are based on the ratings that the movies receive. Once the neural net is designed, it will be tested against a test dataset.

The following three features are considered as inputs to the neural network: movie runtime, the number of genre tags applied to the movie, and the year that the movie was produced.

Note: The choice of features is a very important to the success of the neural nets. It should have enough information to achieve the classification of interest. Why were these three chosen? The year captures the temporal aspect of the problem (movie goer preferences may change across generations).

The organization of the post is as follows: First, a brief description of the neural net will be given, which will be followed by actual designing , testing, and conclusion.

What is a neural net?

A neural net consists of a set of input features, one or multiple hidden layers, and finally an output. The goal is to predict the output given a set of inputs. In order to achieve that the neural net needs to be trained first with the dataset available. The dataset drives the selection of the coefficients or connection weights of the neural net, which are instrumental in predicting the output. Following is the schematic of a neural network.

In the above diagram there are three inputs, two outputs, and one hidden layers in the neural net. Each circle is a neuron. The number of neurons in the input, the hidden layer and the output are 3, 4 and 2 respectively (bias units are excluded). Each neuron has a set of inputs and one output. A linear combination of the inputs is formed using the weighing coefficients. It is then used as an argument to a sigmoid function. The output of the sigmoid function is used as one of the inputs for the next layer. The problem of designing a neural net is equivalent to finding the optimal coefficients that the inputs of various nodes are multiplied by. The optimal coefficients are determined in the following manner. A cost function involving all the coefficients of the neural net is defined. Then the cost function is minimized w.r.t. the coefficient parameters mentioned before. The design process of the neural net becomes complete as the coefficients, for which the cost function attains its minimum, are ascertained. Some of the design challenges are deciding on the number of hidden layers, the number of neurons in each layer, and the value of the Lagrange multiplier to be used to be able to avoid both the over and the under fitting of the training datasets.

Back to our problem

For our problem we choose three inputs, one hidden layer with two neurons and three outputs respectively. There are close to 2800 movies. About 60% of that, i.e., 1600 we will use as the training dataset. 20% of the remaining dataset, i.e., 500 we will use as the cross-validation data set, meaning we will determine the value of the constraining parameter lambda on this dataset. Of course that value of lambda will be chosen, which would give the smallest cross validation error. Finally using the test dataset, the test error of the neural net are computed.

Results:

We vary the regularization parameter lambda in the cost function to obtain the following plot.

It is noted that in the above plot the accuracy goes down as lambda increases. This means that in order to get the best result, one needs to use the non-regularized cost function. The above graph is produced with the help of the cross-validation dataset. Hence the percentage error displayed along the y axis is the cross-validation error. The test error turns out to be even smaller.

The problem from the perspective of logistic regression:

We will slightly change the problem statement. Instead of having three possible outputs, we will assume there are only two possibilities so far as the prediction algorithm is concerned: good or bad. If the predicted rating of a movie is more than 6 we will call it a good movie, if it is less than 6, the movie will be considered to be bad. We will use logistic regression to find a solution to the problem.

Logistic regression:

First we will describe the theory in brief. In logistic regression, for a given set of inputs one has only two possible outputs. There is a cost function for the logistic regression, which involves the logarithm of a sigmoid function. In the following the logistic regression model is trained, and then the correct regularization parameter is determined. The process of determining the boundary is admittedly largely visual. There can be other methods like computing the cross validation error, or choosing that value for the parameter lambda, for which the cross-validation error is minimum. As ironic as it may sound, the problem with that approach is the automation of the process. When deciding on the correct boundary, there is no better way than visually inspection. We first normalize the dataset in the following manner. The mean and the standard deviation for the movie runtime are computed. The mean is subtracted from every runtime, and the result is divided by the standard deviation of the runtime. Instead of using the original runtimes, the normalized runtimes are used. In a similar way, the number of features is normalized. In the following the normalized dataset, instead of the original ones, is plotted and the boundary line is shown.

The above diagram shows that the boundary encloses the space in which most of the red circles are concentrated. The prediction rate is about 80%.

Conclusion:

In conclusion, we applied a non-linear logistic regression model to the IMDB movie dataset. We used two features of the movies, e.g., the movie runtime and the number of features that a movie has, in order to determine good movies from the no so good movies. The goodness of a movie was ascertained by the rating it received. The boundary was obtained by constructing a feature function with a 6th degree polynomial. The regularization parameter was chosen to be 1, since that gave the best boundary. Ideally a good boundary is the one which encloses most of the data-points of one type, thereby separating them from the other type of dataset. From the above diagram we can see that, to some extent, is the case. A good boundary depends as much as on the dataset as on the algorithm. The dataset for our problem makes the issue of coming up with a more accurate boundary somewhat of an impossibility.

"All models are wrong, but some are useful"

Tuesday, September 29, 2015

CLASSIFICATION OF IMDB MOVIES

No comments:

Post a Comment