"All models are wrong, but some are useful" : January 2021

Almost everyone these days uses Yelp, which is a web-based application that allows users and guests to rate their experience of a business or service. I often use Yelp for selecting a restaurant. Yelp displays the average rating of a restaurant in terms of a number of stars. This average is calculated based on the ratings given by reviewers who chose to write about their experiences of the restaurant. The Yelp page also has a ‘useful’ button for each review, the count of which reflects how many people found that particular review helpful. The ‘useful’ votes provide important statistics on the quality of the reviews and hence the relevance of the rating. Yet, it is summarily ignored in the current average rating calculation algorithm used by Yelp. I set out to design a new algorithm for calculating the effective rating for a restaurant based upon:

the actual number of stars for a review,
the reported usefulness of the review, and
the spread of the ratings about its mean value.

Step 1: Get the data with webscraping
Given below is a flow diagram describing how webscraping was implemented to collect the raw data for this analysis:

Note that, each restaurant pulled in this study had at least 100 reviews on Yelp. Some of the restaurants received a lot more reviews compared to others. For those restaurants, I limited myself to about 400 reviews.

Why should we object to a plain average, as a candidate for the overall rating?

The wide distribution of standard deviations in rating per review, implies that the plain average is pretty meaningless.

Above figure shows the distribution of standard deviations for ratings per review. The peak occurs for a value of standard deviation larger than 1. This is a rather high value for average ratings which are in the range 1 to 5. From the figure we notice that about 10% of the restaurants have very high values of standard deviations (2 or above). Less than 5 percent of the restaurants have very low standard deviations. This figure accentuates the weakness of the plain average ratings Yelp shares on their websites today.

Step 2: Using a Weighted Mean

A review is of good quality, reliable, if people are finding it useful. So it should count more toward determining the effective rating for that restaurant than a review that no one finds useful. With this logic, I chose to apply the count of the usefulness for a given review as the weight corresponding to the rating of that review and calculated a weighted mean to reflect the effective rating, rather than the direct mean. An underlying assumption is that if people find a review to be useful, they also agreed with the actual rating given by the reviewer. So, the total frequency corresponding to the particular review was counted as #useful+1, which included the reviewer in the count. The frequency or weight thus calculated for each rating were then used to compute the average.

Here is a worked out example: Suppose there are 3 reviews for a restaurant, with the ratings 2 stars, 5 stars and 1 star respectively. Further, that the numbers of people finding the 2 and 5 star ratings to be useful are 3 and 10 respectively. But the 1 starred review did not have any ‘useful’ ratings. The weighted average for that restaurant is calculated as:

,where the usual notations have been adopted. Had one taken the average without the weights (the way it is calculated in Yelp), one would have obtained 2.66 instead of 4, a rather significant mismatch.

Is the Weighted Mean making a Difference?

The unweighted average ratings from Yelp for the restaurants under consideration were binned and a further average of the ratings computed, one per bin. The same procedure was applied for the weighted average of the ratings; and the standard deviations. So, if number of bins is 20, this method ends up producing the weighted average, the unweighted average and the standard deviation vector of length 20. Next, the unweighted averages are plotted as solid red circles, and the standard deviations are indicated as the error bars in the same diagram. The mathematical equation for the dotted black line is y=x.

Means may stay the same weighted versus unweighted, but the standard deviations tell the inner story.

What is this plot telling us? Had the filled red circles exactly fallen on the y=x line, that would have indicated the weighted and the unweighted averages are exactly equal. Here we see that the red circles are almost always slightly below the dotted line, indicating more often than not, the Yelp unweighted average is somewhat overestimated. Interestingly, although the weighted and the unweighted means don’t differ significantly, the standard deviations are not insignificant. So, somehow the spreads in the ratings are conspiring to keep the mean somewhat similar to the unweighted mean, but that is clearly not the whole story.

Step 3: Incorporating the Standard Deviation

Using the weights corresponding to the ratings one can also calculate the standard deviation. Intuitively a standard deviation measures how the values are spread about its mean value. A large standard deviation indicates the values to be scattered all over around the mean, whereas a small value of the standard deviation is an indicator that the ratings for the reviews are close to the mean and hence the mean is a faithful representative of the rating distribution. The standard deviation for the example given above is calculated as follows:

,where all the notations used have been introduced before.

Standard deviations are not used currently to estimate the ratings presented by Yelp. Hence there is no way to know how reliable a Yelp average is. For example a ‘4’ Yelp mean, with a standard deviation of ‘3’ would be much less reliable than the same mean with a much lower value of standard deviation, say .5. We should be able to define a measure of reliability in which both mean and standard deviation are taken into account. We will come back to this point later.

What is the relationship between the Average Rating and the Standard Deviation?

Here is a study of the standard deviation of the restaurants as a function of the weighted average rating to help us visualize how it evolves as a function of the weighted means.

Higher the rating, better is the agreement among customers on the satisfaction level

In the above figure, the standard deviation clearly shows a decreasing trend when plotted as a function of the weighted average rating. This is a rather interesting result which warrants some discussion. We can see from the graph for the average rating below 3.5 the standard deviations are kind of all over the place. After 3.5 the standard deviation undergoes a steady decline as the average rating goes up. This result implies that for restaurants with high Yelp ratings, the consensus on the rating is somewhat unanimous. For restaurants which receive poor ratings, people differ quite a bit in their opinions, thereby increasing the standard deviation. It should be kept in mind that by no means lower ratings always imply increased standard deviations. Had all the reviewers given a restaurant similar low ratings, its standard deviation would have been small, despite the fact that its average rating was low. Hence the plot above brings out a really interesting truth about the customer experience in a restaurant: the higher the rating of the restaurant, the better is the agreement among the customers on the satisfaction level, reflected in the low standard deviation of the restaurant.

Step 4 - Using both the Weighted Average & the Standard Deviation via F1 Scores

A large standard deviation affects the reliability of a plain average. Hence we are in need of a metric which punishes high standard deviation and rewards low ones. The F1 score is used in the machine learning community for situations following two parameters and requiring high scores if both the quantities under consideration have high values and a low score if one of the quantities has a low value. This can be mapped to our problem. We define the weighted mean as one of the quantities, and the inverse of the standard deviation as the other. F1 score is thus defined in the following manner:

where M is the mean and σ is the standard deviation. The F1 score increases if the mean increases and the standard deviation decreases, which is a desirable feature. Also, the F1 score decreases as σ increases for a given M.

Next, we mapped the F1 score so obtained, back to the range of ratings of 1 to 5, by the following mechanism: We chose 5 bins between 0 to 100 to represent the five rating categories. We wanted to have a certain fraction of the F1 scores belong to a certain bin. For example, an x percentile of the F1 scores is an F1 score value, such that all the F1 score values less than that particular score, constitute x% of all the F1 score values. We identified the F1 score percentile vales corresponding to the edges of our predefined bins: 10%, 30%, 70% and 90% respectively, as captured in this table:

Rating 1	F1 score < 10 percentile.
Rating 2	F1 score *between 10 percentile-30 percentile.*
Rating 3	F1 score *between 30 percentile-70 percentile.*
Rating 4	F1 score *between 70 percentile-90 percentile.*
Rating 5	F1 score *> 90 percentile.*

The above rating system can be described pictorially with the help of the following histogram plot:

From F1 Scores back to a 1 - 5 Rating

Try it out: The NEW rating 3 in the diagram has a normalized frequency of .4, i.e. 40%, which is consistent with all the values between the 30 percentile to 70 percentile (70%-30%=40%). Of course the choice of the values for the bin edges is arbitrary. But any grading system has some arbitrariness to it. The above histogram plot shows that our mechanism is symmetrical about the mean rating 3, containing the majority of the F1 scores. The categories 2 and 4 have half the number of F1 scores as compared to 3: a rather reasonable assumption, which also contribute to the nice bell shapedness of the distribution.

How does the F1 Score based Rating Perform?

In the following, various ranges of Yelp scores are considered. For a given Yelp score range, the F1 score based ratings are calculated and then a histogram of those F1 scores is constructed. These plots are rather remarkable, since they show how strikingly the new ratings can be different from the conventional rating system. Let’s take for example, the category of Yelp rating between 2 and 3. It is noticed from the first figure about 60% of the entries end up having rating 2 in the F1 score based rating system. For the Yelp score range 3-4, 15% (not insignificant) of the restaurants end up having a rather poor rating of 1 in the new system. For the range 4-5, less than 25% have a rating of 5 in the new system.

Serious re-distribution of Rating Values using the more logical F1 Score based system!

Hence it is noticed that the F1 score based new rating system produce results which are non-trivial. Because it is also based on intuitive reasoning involving both the mean and the standard deviation, it can provide a very useful rating system for the restaurants. This rating will be much closer to the restaurant’s performance compared to what can possibly captured by the simple calculation of unweighted average.

Does the length of the review mean anything?

In this section, I's like to share some results I find interesting: How does the length of the comment relate to the rating? I went back to use of the plain average ratings at this point. Defining a good Yelp rating to be 4 and 5, and poor or bad Yelp rating to be 1 and 2, I computed the ratio of the number of words per ‘good’ review to the number of words per bad review as follows:

We say more, when we are saying positive things!

We notice that the said ratio is almost always larger than one and furthermore increases with the increase of the Yelp rating. This reveals something rather curious about the reviewers. Curiously enough, we are happy to speak lengthily when we have good things to say!

Code

"All models are wrong, but some are useful"

Sunday, January 31, 2021

De(re)constructing Restaurant Ratings from Reviews on Yelp