Introduction
In this post we study the movies that appear at the IMDB website. We have collected information about the movies by web-scraping. The information gathered about a movie is its rating, genre, runtime and year of production. The intension is to study the relationships between some of these features. Python and Matlab have been used for the web-scraping and the plotting purposes respectively. It’s worth mentioning that a movie can have more than one genre-type as its feature. Explicitly, the same movie can have two different genres like ‘romance’ and ‘action’. That adds to the complexity of reading data systematically from a file. This has been taken care of by appropriate Matlab coding. The principal challenge in visualizing the results is the large number of movies distributed over 70 or such years. This has been overcome by binning the time axis appropriately and then taking the average of the entries corresponding to each bin. The results of the study are not always intuitive. The correlation between the rating and the runtime of the movies, which will shortly be discussed, is one such example. This is the most important of all the findings of this study.
The relationship between the movie runtime and the rating
In this section it is addressed how the user experience of watching a movie, expressed in terms of the ratings that the user gives, gets affected by its runtime. On the face of it, it would seem that there would be no correlation between the runtime and the rating that a movie receives. We have come up with a different conclusion. We will get to that in due time. First a few words on the reduction of the dimensionality of the dataset. The runtime for most of the movies can be anywhere between 70 minutes to 240 minutes. After the dataset for the runtime is collected, the whole range from the minimum to the maximum runtime is divided into a fixed number of bins. For each bin, the average of the runtimes and the average of the ratings corresponding to the bin are calculated. It’s an effective way to reduce the dimensionality of the dataset, and yet capture any trend that may be present in it. In the following plot, the bin-averaged ratings are plotted against the corresponding bin-averaged runtimes. The dataset is then fitted with a line. The increasing trend is obvious. And the trend turns out not to be linear but quadratic. This is a rather interesting finding since it’s somewhat counterintuitive. It is generally expected that a movie should be rated based on its content rather than its length. But our results show a direct correlation between the length of a movie and the rating it receives. It’s hard to speculate why that should be the case. One reason can be that unless a director is confident of his skill and of the fact that the movie has a high probability of success, he would not go about making a long movie. It can also be explained form a psychoanalytical perspective of the audience. It may be that people tend to feel more satisfied the longer they get to spend in the movie theatre. They may think that the experience is a solid one because they were involved with the plot of the movie for a longer period of time. Whatever the reason is, the graph below is an interesting finding.
The most popular genres as a function of year.
In this section we study the most frequently occurring genre in a given year, and how that changes over years. As mentioned earlier, a movie can have different genres. Information about genres for various movies has been extracted, and it has been counted how many times a particular genre, say, ‘action’ appears in a given year. The genre which appears the maximum number of times in a year is noted. I have plotted a time series graph showing how the most frequently occurring genre changes as years progress. I also plotted the number of movies in a given genre as a fraction of the total number of movies produced for each year. That allows one to compare the most frequently occurring genres. For example, consider two different years. Let’s assume that for one of the years action is the most frequently occurring genre and 25% of the total number of movies is characterized by it. For the other year romance has a whopping 90% frequency of occurrence, and is also the most frequently occurring genre for that year. Hence romance in the second year wins as the most frequently occurring movie-genre with a much larger margin compared to action in the first year.
In the above diagram we see that drama has been the winner genre for most of the years. It is also noticed that more drama movies were made before 1970 than after. A couple other interesting observations are as follows. Around 1940’s almost 40% of the total movies made were biographical. Around 1960 the frequency of comedy movies increased. It was around 1980s when action movies stood out as compared to the other genres. Around 1938 there was an increase in the number of war movies.
Studying how frequencies of various genres change with time.
In the following we will study how the number of movies for a given genre as a fraction of the total number of movies varies from year to year. We will show plots for various genres like romance, action, horror etc.
In the above diagram, it is noticed that the frequency of action movies increased and then saturated as years progressed. The number of movies of the genre drama oscillated. Their production peaked around 1945 or so. About 30% of the total movies belonged to the genre drama. That number plunged down to below 10% around 1990. Recently the production rate of the drama movies seem to have reached a saturation (20%).
In the above set of figures we see that the frequency of the comedy movies reached a saturation rate of about 13% per year after steadily increasing over years. The rate of production of fantasy movies has been low. Recently it’s on the rise. Adventure movies show a somewhat decreasing trend. The trend for the thrillers is just the opposite. It’s on the rise.
In the above figure it is observed that the crime, mystery, sci-fi movies have reached a steady production rate over years. The fraction of horror movies has remained somewhat less, but recently they seem to be on the rise.
In the above figure, the fractions of the movies of the genres ‘Music’ and ‘Musical’ increased and decreased respectively. Production of the romantic movies somewhat decreased compared to the past, but they have reached a steady rate. A small fraction of the total number movies is the animation movies, and the recent trend seems to be an increasing one so far that genre is concerned.
Conclusion
In this post we discovered a very important truth regarding the relationship between the rating that a movie receives and the runtime of the movie. Surprising (or may be not) as it might sound, people tend to give a movie higher rating, if it runs longer. We also inspected which genre occurred most frequently in the movies for a given year. We plotted the most frequently occurring genres as fractions of the total number of movies against years. This helped us to visualize how the preference for the genres changed over time. We also studied a specific genre over time. As a possible future direction, one can use various features corresponding to a movie, like its runtime, genre type, number of genres etc., to spot a movie which stands out compared to the other movies produced in the same year with the help of anomaly detection techniques. One can also calculate the fraction of such ‘anomalous’ movies and plot the fractions as a function of time. Another possible future work can be to come up with a prediction algorithm for new movies. Using the features and the ratings for a large number of movies, one can train an algorithm to predict if a movie with such and such feature would get good rating in future. Work along these lines will be pursued in the future blog-posts.
No comments:
Post a Comment