"All models are wrong, but some are useful" : 2015

Tuesday, September 29, 2015

CLASSIFICATION OF IMDB MOVIES

Introduction

This is the third blogpost with regard to the IMDB movie dataset. The goal for this post is to design a neural net; train it, such that it learns to classify a movie as good, okay, or bad. The annotations of good, okay, and bad are based on the ratings that the movies receive. Once the neural net is designed, it will be tested against a test dataset.

The following three features are considered as inputs to the neural network: movie runtime, the number of genre tags applied to the movie, and the year that the movie was produced.

Note: The choice of features is a very important to the success of the neural nets. It should have enough information to achieve the classification of interest. Why were these three chosen? The year captures the temporal aspect of the problem (movie goer preferences may change across generations).

The organization of the post is as follows: First, a brief description of the neural net will be given, which will be followed by actual designing , testing, and conclusion.

What is a neural net?

A neural net consists of a set of input features, one or multiple hidden layers, and finally an output. The goal is to predict the output given a set of inputs. In order to achieve that the neural net needs to be trained first with the dataset available. The dataset drives the selection of the coefficients or connection weights of the neural net, which are instrumental in predicting the output. Following is the schematic of a neural network.

In the above diagram there are three inputs, two outputs, and one hidden layers in the neural net. Each circle is a neuron. The number of neurons in the input, the hidden layer and the output are 3, 4 and 2 respectively (bias units are excluded). Each neuron has a set of inputs and one output. A linear combination of the inputs is formed using the weighing coefficients. It is then used as an argument to a sigmoid function. The output of the sigmoid function is used as one of the inputs for the next layer. The problem of designing a neural net is equivalent to finding the optimal coefficients that the inputs of various nodes are multiplied by. The optimal coefficients are determined in the following manner. A cost function involving all the coefficients of the neural net is defined. Then the cost function is minimized w.r.t. the coefficient parameters mentioned before. The design process of the neural net becomes complete as the coefficients, for which the cost function attains its minimum, are ascertained. Some of the design challenges are deciding on the number of hidden layers, the number of neurons in each layer, and the value of the Lagrange multiplier to be used to be able to avoid both the over and the under fitting of the training datasets.

Back to our problem

For our problem we choose three inputs, one hidden layer with two neurons and three outputs respectively. There are close to 2800 movies. About 60% of that, i.e., 1600 we will use as the training dataset. 20% of the remaining dataset, i.e., 500 we will use as the cross-validation data set, meaning we will determine the value of the constraining parameter lambda on this dataset. Of course that value of lambda will be chosen, which would give the smallest cross validation error. Finally using the test dataset, the test error of the neural net are computed.

Results:

We vary the regularization parameter lambda in the cost function to obtain the following plot.

It is noted that in the above plot the accuracy goes down as lambda increases. This means that in order to get the best result, one needs to use the non-regularized cost function. The above graph is produced with the help of the cross-validation dataset. Hence the percentage error displayed along the y axis is the cross-validation error. The test error turns out to be even smaller.

The problem from the perspective of logistic regression:

We will slightly change the problem statement. Instead of having three possible outputs, we will assume there are only two possibilities so far as the prediction algorithm is concerned: good or bad. If the predicted rating of a movie is more than 6 we will call it a good movie, if it is less than 6, the movie will be considered to be bad. We will use logistic regression to find a solution to the problem.

Logistic regression:

First we will describe the theory in brief. In logistic regression, for a given set of inputs one has only two possible outputs. There is a cost function for the logistic regression, which involves the logarithm of a sigmoid function. In the following the logistic regression model is trained, and then the correct regularization parameter is determined. The process of determining the boundary is admittedly largely visual. There can be other methods like computing the cross validation error, or choosing that value for the parameter lambda, for which the cross-validation error is minimum. As ironic as it may sound, the problem with that approach is the automation of the process. When deciding on the correct boundary, there is no better way than visually inspection. We first normalize the dataset in the following manner. The mean and the standard deviation for the movie runtime are computed. The mean is subtracted from every runtime, and the result is divided by the standard deviation of the runtime. Instead of using the original runtimes, the normalized runtimes are used. In a similar way, the number of features is normalized. In the following the normalized dataset, instead of the original ones, is plotted and the boundary line is shown.

The above diagram shows that the boundary encloses the space in which most of the red circles are concentrated. The prediction rate is about 80%.

Conclusion:

In conclusion, we applied a non-linear logistic regression model to the IMDB movie dataset. We used two features of the movies, e.g., the movie runtime and the number of features that a movie has, in order to determine good movies from the no so good movies. The goodness of a movie was ascertained by the rating it received. The boundary was obtained by constructing a feature function with a 6th degree polynomial. The regularization parameter was chosen to be 1, since that gave the best boundary. Ideally a good boundary is the one which encloses most of the data-points of one type, thereby separating them from the other type of dataset. From the above diagram we can see that, to some extent, is the case. A good boundary depends as much as on the dataset as on the algorithm. The dataset for our problem makes the issue of coming up with a more accurate boundary somewhat of an impossibility.

Thursday, September 17, 2015

SEARCHING FOR “UNUSUAL” IMDB MOVIES

Introdction:

This blog is a continuation of a previous blog, in which the IMDB movies were studied. Here we use the same data set, and our goal is to find out how the fraction of movies, which stand out compared to the rest, evolve over years. The unusualness of a movie is determined from the two following features: rating that the movie receives and the number of genres it is characterized by. It is noted that the feature movie runtime is not considered, the reason being, there is a direct correlation between the movie runtime and the movie rating, as has been shown in the previous post, thereby rendering runtime a redundant feature.

Theory:

The theory for the study is that of anomaly detection. It has been brought up in a previous blog-post. Here we discuss it for a special case in which the variables are independent. This is due to the fact that the features in our problem, the rating and the runtime are mutually independent.

Probability Distribution of Random Variables:

The probability distribution function (pdf) of a random variable yields the information about the probability of a random variable assuming values within a certain range. For example if a random variable

has a pdf

, the probability that it will assume values between

and

is given by

. A commonly used pdf is the Normal (Gaussian) distribution. A random variable is said to follow a Normal distribution with mean

, and standard deviation

, if its pdf is given by:

The following figure shows three normal distributions. Their widths are all different. The width is controlled by the parameter

. The bigger the

, the larger is the width. Two of the distributions are centered at 0, i.e. have

and the third one is centered around .5, i.e.

. From the diagram it is seen that the pdf reaches its maximum at mean

, and tapers off away from the mean. Next we consider the joint probability distribution of two independent random variables

and

. Let

follow a Normal distribution with mean

and standard deviation

, and

follow a Normal distribution with mean

and standard deviation

. Since they are assumed to be independent, their joint probability distribution is given by:

In the following the surface and the contour plots of the above probability distribution function are shown. The contours are circular. The probability is maximum at the point (0,0), the two entries in the first bracket being the means of the two variables for this specific example. The probability falls down as one goes away further and further from the center. The expression for the probability distribution of multiple variables is a straightforward extension of the two variable result.

In the next section we will explain how the concept of the multivariate probability distribution can actually be used to detect anomalies in a set of movies.

Detecting Anomaly From the Joint Probability Distribution:

Given the features of a movie, it is possible to construct a joint probability distribution, in which the two features mentioned above correspond to the random variables

and

. We will use for our problem the value of the joint probability as a measure of the degree of anomaly, a small value of the probability indicating an extremely anomalous behavior. The probabilities (except for a normalization factor) are calculated from the multivariate joint probability distribution. The above procedure is carried out only for those years, in which the number of movies produced exceeded 20. It turns out that most of the movies made after 1978 fall in that category. We tried to estimate the fraction of ‘unusual’ movies made in a certain year in the following manner. The probability values of the movies in a given year follow a frequency distribution. The anomalous movies are characterized by their probabilities being less than a certain small value, say one percent of the mean probability for a given year. After identifying the anomalous movies, the number of anomalous movies in a given year is divided by the total number of movies produced in that year to obtain the percentage of the movies which are anomalous. The net anomaly of a year is defined as the sum of the probabilities of all the movies produced in that year divided by the total number of movies. The division by the total number of movies makes the result independent of the number of movies produced in each year.

Results:

In the following figure we have plotted the net anomaly against years.

It can be seen from the above diagram that the net anomaly does not vary significantly over years. From the graph it also seems to be the case that the spread in the anomaly decreased as the years progressed: from around 2003 onward the net probability value varied slightly, stayed more or less between .5 and .53, with the exception of the current year 2015. But the dataset for 2015 is not complete, since the year is still in progress. The probability values before 2003 varied somewhat wildly between .44 and .6. Hence it may be concluded that there was more consistency in the production of anomalous movies after the year 2003 than before. Next we plot the percentage of anomalous movies as a function of years. As can be seen in the graph below, the percentage decreases as the year progresses. The characterization of the anomalous movies is somewhat arbitrary. The ‘small’ probability threshold value used to produce the plot below is one percent of the mean probability value. It turns out that the pattern disappears as the probability threshold is increased.

Conclusion:

In this post we studied the anomaly of movies based on their features. Two specific features chosen for the movies were the runtime and the rating. For each year a non-normalized joint probability distribution of the movies was constructed. The probabilities were plotted as a function of years. The net probability, when plotted as a function of years, stayed more or less constant. But when the ratio of the number of movies, whose probability value was less than one percent of the mean probability value, to the total number of movies was plotted as a function of years, there was clearly a decreasing trend. From that it can be concluded that really unusual movies decreased as years progressed. This pattern vanished as the ‘unusualness’ restriction on the movies was relaxed.

Tuesday, September 1, 2015

Studying IMDB Movies

Introduction

In this post we study the movies that appear at the IMDB website. We have collected information about the movies by web-scraping. The information gathered about a movie is its rating, genre, runtime and year of production. The intension is to study the relationships between some of these features. Python and Matlab have been used for the web-scraping and the plotting purposes respectively. It’s worth mentioning that a movie can have more than one genre-type as its feature. Explicitly, the same movie can have two different genres like ‘romance’ and ‘action’. That adds to the complexity of reading data systematically from a file. This has been taken care of by appropriate Matlab coding. The principal challenge in visualizing the results is the large number of movies distributed over 70 or such years. This has been overcome by binning the time axis appropriately and then taking the average of the entries corresponding to each bin. The results of the study are not always intuitive. The correlation between the rating and the runtime of the movies, which will shortly be discussed, is one such example. This is the most important of all the findings of this study.

The relationship between the movie runtime and the rating

In this section it is addressed how the user experience of watching a movie, expressed in terms of the ratings that the user gives, gets affected by its runtime. On the face of it, it would seem that there would be no correlation between the runtime and the rating that a movie receives. We have come up with a different conclusion. We will get to that in due time. First a few words on the reduction of the dimensionality of the dataset. The runtime for most of the movies can be anywhere between 70 minutes to 240 minutes. After the dataset for the runtime is collected, the whole range from the minimum to the maximum runtime is divided into a fixed number of bins. For each bin, the average of the runtimes and the average of the ratings corresponding to the bin are calculated. It’s an effective way to reduce the dimensionality of the dataset, and yet capture any trend that may be present in it. In the following plot, the bin-averaged ratings are plotted against the corresponding bin-averaged runtimes. The dataset is then fitted with a line. The increasing trend is obvious. And the trend turns out not to be linear but quadratic. This is a rather interesting finding since it’s somewhat counterintuitive. It is generally expected that a movie should be rated based on its content rather than its length. But our results show a direct correlation between the length of a movie and the rating it receives. It’s hard to speculate why that should be the case. One reason can be that unless a director is confident of his skill and of the fact that the movie has a high probability of success, he would not go about making a long movie. It can also be explained form a psychoanalytical perspective of the audience. It may be that people tend to feel more satisfied the longer they get to spend in the movie theatre. They may think that the experience is a solid one because they were involved with the plot of the movie for a longer period of time. Whatever the reason is, the graph below is an interesting finding.

The most popular genres as a function of year.

In this section we study the most frequently occurring genre in a given year, and how that changes over years. As mentioned earlier, a movie can have different genres. Information about genres for various movies has been extracted, and it has been counted how many times a particular genre, say, ‘action’ appears in a given year. The genre which appears the maximum number of times in a year is noted. I have plotted a time series graph showing how the most frequently occurring genre changes as years progress. I also plotted the number of movies in a given genre as a fraction of the total number of movies produced for each year. That allows one to compare the most frequently occurring genres. For example, consider two different years. Let’s assume that for one of the years action is the most frequently occurring genre and 25% of the total number of movies is characterized by it. For the other year romance has a whopping 90% frequency of occurrence, and is also the most frequently occurring genre for that year. Hence romance in the second year wins as the most frequently occurring movie-genre with a much larger margin compared to action in the first year.

In the above diagram we see that drama has been the winner genre for most of the years. It is also noticed that more drama movies were made before 1970 than after. A couple other interesting observations are as follows. Around 1940’s almost 40% of the total movies made were biographical. Around 1960 the frequency of comedy movies increased. It was around 1980s when action movies stood out as compared to the other genres. Around 1938 there was an increase in the number of war movies.

Studying how frequencies of various genres change with time.

In the following we will study how the number of movies for a given genre as a fraction of the total number of movies varies from year to year. We will show plots for various genres like romance, action, horror etc.

In the above diagram, it is noticed that the frequency of action movies increased and then saturated as years progressed. The number of movies of the genre drama oscillated. Their production peaked around 1945 or so. About 30% of the total movies belonged to the genre drama. That number plunged down to below 10% around 1990. Recently the production rate of the drama movies seem to have reached a saturation (20%).

In the above set of figures we see that the frequency of the comedy movies reached a saturation rate of about 13% per year after steadily increasing over years. The rate of production of fantasy movies has been low. Recently it’s on the rise. Adventure movies show a somewhat decreasing trend. The trend for the thrillers is just the opposite. It’s on the rise.

In the above figure it is observed that the crime, mystery, sci-fi movies have reached a steady production rate over years. The fraction of horror movies has remained somewhat less, but recently they seem to be on the rise.

In the above figure, the fractions of the movies of the genres ‘Music’ and ‘Musical’ increased and decreased respectively. Production of the romantic movies somewhat decreased compared to the past, but they have reached a steady rate. A small fraction of the total number movies is the animation movies, and the recent trend seems to be an increasing one so far that genre is concerned.

Conclusion

In this post we discovered a very important truth regarding the relationship between the rating that a movie receives and the runtime of the movie. Surprising (or may be not) as it might sound, people tend to give a movie higher rating, if it runs longer. We also inspected which genre occurred most frequently in the movies for a given year. We plotted the most frequently occurring genres as fractions of the total number of movies against years. This helped us to visualize how the preference for the genres changed over time. We also studied a specific genre over time. As a possible future direction, one can use various features corresponding to a movie, like its runtime, genre type, number of genres etc., to spot a movie which stands out compared to the other movies produced in the same year with the help of anomaly detection techniques. One can also calculate the fraction of such ‘anomalous’ movies and plot the fractions as a function of time. Another possible future work can be to come up with a prediction algorithm for new movies. Using the features and the ratings for a large number of movies, one can train an algorithm to predict if a movie with such and such feature would get good rating in future. Work along these lines will be pursued in the future blog-posts.