"All models are wrong, but some are useful" : SEARCHING FOR “UNUSUAL” IMDB MOVIES

SEARCHING FOR “UNUSUAL” IMDB MOVIES

Introdction:

This blog is a continuation of a previous blog, in which the IMDB movies were studied. Here we use the same data set, and our goal is to find out how the fraction of movies, which stand out compared to the rest, evolve over years. The unusualness of a movie is determined from the two following features: rating that the movie receives and the number of genres it is characterized by. It is noted that the feature movie runtime is not considered, the reason being, there is a direct correlation between the movie runtime and the movie rating, as has been shown in the previous post, thereby rendering runtime a redundant feature.

Theory:

The theory for the study is that of anomaly detection. It has been brought up in a previous blog-post. Here we discuss it for a special case in which the variables are independent. This is due to the fact that the features in our problem, the rating and the runtime are mutually independent.

Probability Distribution of Random Variables:

The probability distribution function (pdf) of a random variable yields the information about the probability of a random variable assuming values within a certain range. For example if a random variable

has a pdf

, the probability that it will assume values between

and

is given by

. A commonly used pdf is the Normal (Gaussian) distribution. A random variable is said to follow a Normal distribution with mean

, and standard deviation

, if its pdf is given by:

The following figure shows three normal distributions. Their widths are all different. The width is controlled by the parameter

. The bigger the

, the larger is the width. Two of the distributions are centered at 0, i.e. have

and the third one is centered around .5, i.e.

. From the diagram it is seen that the pdf reaches its maximum at mean

, and tapers off away from the mean. Next we consider the joint probability distribution of two independent random variables

and

. Let

follow a Normal distribution with mean

and standard deviation

, and

follow a Normal distribution with mean

and standard deviation

. Since they are assumed to be independent, their joint probability distribution is given by:

In the following the surface and the contour plots of the above probability distribution function are shown. The contours are circular. The probability is maximum at the point (0,0), the two entries in the first bracket being the means of the two variables for this specific example. The probability falls down as one goes away further and further from the center. The expression for the probability distribution of multiple variables is a straightforward extension of the two variable result.

In the next section we will explain how the concept of the multivariate probability distribution can actually be used to detect anomalies in a set of movies.

Detecting Anomaly From the Joint Probability Distribution:

Given the features of a movie, it is possible to construct a joint probability distribution, in which the two features mentioned above correspond to the random variables

and

. We will use for our problem the value of the joint probability as a measure of the degree of anomaly, a small value of the probability indicating an extremely anomalous behavior. The probabilities (except for a normalization factor) are calculated from the multivariate joint probability distribution. The above procedure is carried out only for those years, in which the number of movies produced exceeded 20. It turns out that most of the movies made after 1978 fall in that category. We tried to estimate the fraction of ‘unusual’ movies made in a certain year in the following manner. The probability values of the movies in a given year follow a frequency distribution. The anomalous movies are characterized by their probabilities being less than a certain small value, say one percent of the mean probability for a given year. After identifying the anomalous movies, the number of anomalous movies in a given year is divided by the total number of movies produced in that year to obtain the percentage of the movies which are anomalous. The net anomaly of a year is defined as the sum of the probabilities of all the movies produced in that year divided by the total number of movies. The division by the total number of movies makes the result independent of the number of movies produced in each year.

Results:

In the following figure we have plotted the net anomaly against years.

It can be seen from the above diagram that the net anomaly does not vary significantly over years. From the graph it also seems to be the case that the spread in the anomaly decreased as the years progressed: from around 2003 onward the net probability value varied slightly, stayed more or less between .5 and .53, with the exception of the current year 2015. But the dataset for 2015 is not complete, since the year is still in progress. The probability values before 2003 varied somewhat wildly between .44 and .6. Hence it may be concluded that there was more consistency in the production of anomalous movies after the year 2003 than before. Next we plot the percentage of anomalous movies as a function of years. As can be seen in the graph below, the percentage decreases as the year progresses. The characterization of the anomalous movies is somewhat arbitrary. The ‘small’ probability threshold value used to produce the plot below is one percent of the mean probability value. It turns out that the pattern disappears as the probability threshold is increased.

Conclusion:

In this post we studied the anomaly of movies based on their features. Two specific features chosen for the movies were the runtime and the rating. For each year a non-normalized joint probability distribution of the movies was constructed. The probabilities were plotted as a function of years. The net probability, when plotted as a function of years, stayed more or less constant. But when the ratio of the number of movies, whose probability value was less than one percent of the mean probability value, to the total number of movies was plotted as a function of years, there was clearly a decreasing trend. From that it can be concluded that really unusual movies decreased as years progressed. This pattern vanished as the ‘unusualness’ restriction on the movies was relaxed.

"All models are wrong, but some are useful"

Thursday, September 17, 2015

SEARCHING FOR “UNUSUAL” IMDB MOVIES

No comments:

Post a Comment