From blog posts to books, sorting a list of items by user rating is a very common task. In this blog post, we discover how the writings of an obscure 18th-century British minister lets us implement a “smart” ranking algorithm that takes into account the “wisdom of the crowd”, and presents a fairer and more accurate representation of results.
The Rating Problem
Let’s pretend for the moment that you’ve hankerin’ to pick up a few fresh tips on ASP.NET 3.5. You visit a bookstore and type “ASP.NET 3.5” into the search box. You obtain the results shown in Figure 1.1
This site, like many others, allows users to rate products on a scale of 1 to 5. The average customer rating is displayed when browsing these products. (I’ve included both image-based and numerical displays for the purposes of this discussion). By sorting the results in order of customer rating as shown, you hope to discover which books are most popular among your peers.2
Examining Figure 1, it would appear that ASP.NET 3.5 Website Programming: Problem - Design – Solution is the most popular among these books, with a perfect 5-star rating; and that Pro ASP.NET 3.5 in C# 2008 is the least popular among the rated books, with but a single star to call its own. But, as I’m sure you already suspect, the average rating alone doesn’t tell us the whole story.
Of course, e-commerce sites understand this too. That’s why you’ll almost always find, in addition to the average customer rating, the number of customers who contributed to that rating, as shown in Figure 2.
Leveraging the Wisdom of the Crowd
Knowing the number of times an item has been rated makes the results much more meaningful to us – but have you stopped to consider exactly why this is so? Why do we lend more weight to a rating when an item has been rated by more individuals? This may appear to the casual observer as the manifestation of a “bandwagon effect” of some kind, but that’s not really the case. As it turns out, this is the result of an intuitive mental calculation that mirrors an actual real-word phenomenon - a true example of crowd intelligence in action. What’s more, as we shall soon see, this phenomenon can be modeled in software.
We see from Figure 2 that ASP.NET 3.5 Website Programming: Problem - Design – Solution has the highest average customer rating, and therefore resides at the top of the list. However, since it has only been rated once, we are less inclined to trust this result. Intuitively, we understand that with a small number of ratings, the probability of an anomalous rating (an “outlier” in statistical terms) having a significant impact on the average is greatly increased. In other words, a small number of ratings increases the chance that the overall average rating may be “inaccurate” – not arithmetically inaccurate of course, but inaccurate in the sense that the data doesn’t reflect a true consensus of typical, unbiased readers.
Now take a look at some of the other results. We see that ASP.NET 3.5 Unleashed appears lower on the list by virtue of its lower average rating of 4.8. Nevertheless, we instinctively lend much more weight to its rating data, because far more customers have rated it. If we’re laying down our hard-earned money, it would appear that ASP.NET 3.5 Unleashed is the surer bet (all other things being equal, of course). That 4.8 rating may not be the highest on the list, but a rational observer would invariably conclude that this is a more reliable figure, based on the relatively large proportion of the “crowd” which contributed to it.
Including the number of ratings lets us evaluate the average rating data of each individual item not only on its own, but in relation to all the other items. The more times an item has been rated, the more believable the data, and thus the more relative weight we place upon its average rating.
Wouldn’t it be great if we could somehow leverage the intelligence of the crowd, and reflect it in the results? What if we could rank and sort the results based not solely on the raw average rating, but on the believability of the average rating? Such a ranking algorithm would more accurately reflect how we perceive each item’s ratings in relation to each other. Using this algorithm, an item which received a single 5-star rating wouldn’t immediately jump to the top of the list; likewise, an item that received a single poor rating wouldn’t immediately be forced to the bottom. The “good stuff” (items which receive mostly high ratings from a relatively large proportion of the crowd) would appear near the top, while the “not-so-good” (again, as determined by the crowd) would be pushed lower.
As it turns out, we can develop such an algorithm with the help of an 18th-century British Presbyterian minister named Thomas Bayes.
Like many learned men of his day, Reverend Bayes harbored a keen interest in mathematics. In an essay discovered and published in 1764 – three years after his death – he presents a special case of what eventually came to be known as Bayes’ theorem. Today, Bayes theorem is widely accepted among most statisticians. Applications of it can be found throughout many fields in science and engineering, from the evaluation of drug test results, to predicting the outcome of elections, to the development of spam-filtering software.
Bayes’ theorem states in part that probabilities are rationally coherent degrees of belief, or a degree of belief in a proposition given a body of well-specified information.3 Therefore, Bayes' theorem can be thought of as a representation of how an ideally rational individual would interpret the believability of an outcome, based on a series of prior outcomes.
Bayes’ theorem can be stated mathematically in terms of how the probability of an outcome A given B, is related to the converse probability of outcome B given A.
Applying Bayes’ Theorem to the Rating Problem
A Bayesian average is a method of calculating the mean of a set of data that is consistent with Bayes’ theorem.
- C is a “weighting” constant, proportional to the typical size of the data set
- m is the arithmetic mean taken over all known data
- n is the size of the current data set
For our particular purposes, we can calculate the Bayesian average of each of our books by applying the formula above. This lets us derive a rating that is based on the “believability” of the raw data.
BA = ((AvgNumOfRatingsForAll * AvgRatingForAll) + TotalRating) / (RateCount + AvgNumOfRatingsForAll)
- RateCount is the number of ratings given to this book
- TotalRating is the sum of all ratings given to this book
- AvgNumOfRatingsForAll is the average number of ratings for all books shown (where RateCount > 0)
- AvgRatingForAll is the average unweighted rating for all books shown (where RateCount > 0)
In Figure 3, the results are now sorted by their Bayesian average. The star displays also reflect the calculated Bayesian average.
If you compare these results to those shown in Figure 2, several conclusions can be drawn:
- The Bayesian average is a truer reflection of how the books are rated in relation to each other.
- The Bayesian rank for books with a relatively small number of ratings tends to gravitate toward the average rating of all books.
- The more ratings a book has received, the more “believable” the ratings are. Therefore, when a book has received a relatively large number of ratings, its Bayesian rank gravitates toward its unweighted average rating.
Keeping it Fair
As pointed out, the Bayesian average formula uses a constant C that determines how much “correction” is applied to the raw, unweighted average. By examining the formula, you’ll note that C is a “dampening” value that determines the “sensitivity” of the calculation.
This means that the higher the value of C, the more customer ratings it will take to influence the Bayesian average. In other words, C can be thought of as the number of ratings it will take to have a significant impact on the relative ranking of an item. Higher values of C make the results less sensitive to the determination of the crowd, while lower values of C do the opposite. This is something you might want to take into account when determining what value to use for C.
In practice, this value should be proportional to the typical size of the data set. In our example, we used the average number of ratings given to all rated items in the search result. This works well in most cases in which new items are added on a regular and ongoing basis, such as an inventory of retail items, blog posts, songs, videos, downloadable applications, and the like.
However, consider the case in which new items are added relatively infrequently. When the existing items already have a high number of ratings, a new item will have to receive many ratings in order to move up or down in the list. In your particular application, if you want to give new items a chance to move more quickly, you can reduce the value of C by a given factor, like so:
BA = ((AvgNumOfRatingsForAll * 0.75 * AvgRatingForAll) + TotalRating) / (RateCount + (AvgNumOfRatingsForAll * 0.75))
When sorting a list of items by the “raw” average of the ratings it has received, we often get a distorted view. Results are easily skewed by a small number of ratings (or even a single rating) given to any particular item.
In order to provide a more accurate picture, we need a smarter rating system – one that can accurately quantify the “wisdom of the crowd.” This is something we naturally do when the size of the “crowd” is known to us. When an item has received many ratings, that data should be considered more “reliable,” and therefore its ranking in the list should more closely reflect the crowd’s decision. When an item has received very few ratings, its ranking should approximate the average rating for all the items shown.
We can use Bayes’ theorem to derive a smart rating algorithm to accomplish this task. Furthermore, the weighting applied by the Bayesian average formula can be adjusted to suit different purposes, thus making new items more sensitive to the influence of the crowd.
Download the Code
If you’re interested in playing around with this on your own, you can download the solution/database I used to create the screenshots in this post. You can mess around with the database values and/or the weighting constant in order to see how different scenarios affect the results.
1. Book cover images courtesy of Amazon.com.
2. All screenshots are for illustrative purposes only, and should not be construed as an actual reflection of the popularity or quality of these titles in “real life,” okay?