“I always pass on good advice. It is the only thing to do with it. It is never of any use to oneself.”
― Oscar Wilde
Recommendations produced by recommendation engines are mechanical advice – presented by a website to an individual about what they might like in the future.
As Oscar Wilde’s witticism observed, advice tends to suffer from a peculiar problem. It tends to be right in general, but wrong in the specific example of one’s own situation.
As the designers of recommendation systems, we spend a fair amount of time grappling with this conundrum, after all, we must supply recommendations that our users do actually act on. So what do we mean by a good recommendation? What in-fact is good advice?
If we were reflecting on good recommendations we might use terms such as accuracy, precision or sensitivity. In the field of merchandising, we might use terms such as diversity or coverage. If we were speaking more poetically: serendipity or novelty. In this article, we discuss what these terms mean in the context of recommendation systems.
While these terms may sound interchangeable, we have mathematically precise definitions of each of these concepts. Once we are clear on what “good” is, then it is relatively straightforward to choose the best algorithm. Our challenge then is to choose the right objective given the delicate balance between users, suppliers and commercial constraints.
At the core of any set of search results is our ability to predict if a user is going to interact (click) on a particular piece of content. The first question that comes to mind when reviewing such a model is: how accurate is our prediction?
Accuracy is simply a measure of how often our recommendation was correct. Did the user click on this article? Yes or No? Did we predict it correctly? We can understand all the permutations of predictions and actuals using the aptly named Confusion Matrix.
|Predicted No Interaction||Predicted an Interaction|
|Didn’t Interact||True Negative||False Positive|
|Interacted||False Negative||True Positive|
This matrix shows all the possible results in a binary classification task such as predicting if a given user clicked on a particular listing when they viewed a webpage. From our website logs, we know the actual result, and from our model, we have our prediction. Accuracy is calculated as the ratio of successful predictions.
Accuracy = (True Positives + True Negatives) / All Predictions
One problem with the use of accuracy in recommendation systems is that the thing that we are trying to predict (a user interacts with a particular piece of content) is often quite rare – in the order of 1%. The unbalanced nature of the prediction task means that apparently high levels of accuracy can be achieved with very simple models. In our example, a model that always predicted: “No Interaction” would be right 99% of the time and yet would tell us nothing interesting about the problem at hand. For example:
The confusion matrix throws up two more metrics that are worth understanding: Sensitivity and Precision.
Sensitivity = True Positives / ( True Positive + False Negative)
Sensitivity, also known as recall, is the percentage of actual interactions that are correctly identified. It’s how good we are at finding the things that the user is interested in.
Precision, on the other hand, is how often we are correct when we predict that there will be an interaction.
Precision = True Positives / (True Positive + False Positive)
There is an inherent tension between sensitivity and precision and we can usually trade off one for the other based on the preferences of the business. Returning to our original example, imagine this time that our model always predicted an “Interaction”. Our sensitivity (10 / 10) would be perfect, but our precision would be terrible (10 / 1000).
Typically businesses end up forming a view on their preference for accuracy, precision and sensitivity based on the downside of false predictions.
There is another problem inherent with accuracy measures. They tend to recommend items that are broadly popular, but which the audience was already aware. Imagine a music recommendation system. A recommendation system might recommend that people who like 90’s music will specifically like seminal Seattle grunge band Nirvana. The problem for the recommender is that while this statement is accurate and that people who like 90’s music will also express a preference for Nirvana, it’s not a particularly useful observation. Most of the people will already be aware of the band. They will have already formed their view, and so the recommendation is unlikely to change the user’s behaviour. From the point of view of the website, there is little benefit being “Captain Obvious” and recommending a band that the user was already well aware of.
So what are the alternatives to Captain Obvious’s fixation on accuracy? One approach is to focus on novelty. That is to pick items that are unlikely to be known to a particular user. Using our music example, a novel recommendation for a lover of 90’s music would be to propose indie power-pop stalwarts Dramarama.
Novelty can be thought of as the inverse of popularity calculated either globally, or for the particular user segment. Novelty is great for adding variety into the search results, but it does tend to throw out weird, oddball recommendations.
A different approach that doesn’t suffer from these oddball picks is Serendipity. In the context of a recommendation engine, we can think of this as the listing that is both novel and relevant to the user. Using serendipity we tend to choose content that we think is of particular interest to this user when compared to the average user. Back to our lovers of 90’s music example, a recommendation like Nirvana is punished under a serendipity measure because of their overall popularity.
Until now, we have been thinking about recommendations through the lens of the user (which is always a good place to start) but we cannot ignore the perspective of the supply side. It is here that it is worth mentioning two other terms: diversity and coverage.
Diversity is a measure of how much variety is in our result set. A recommendation that doesn’t consider diversity will tend to fixate on content that is too similar, for example, a fashion recommendation system throwing up the same dress multiple times in different colours. From a commercial perspective, we usually want to feature a variety of different suppliers to ‘spread the love’. By casting the net wider we reduce the chance that the user rejects the first item and then rejects all the subsequent items because they are basically the same as the first. Diversity as a measure is often calculated using the Intra List Similarity score (the distance between all the pairs in the result set).
Coverage is generally thought of as the percentage of items shown by the recommender. A recommender with perfect coverage would be able to recommend all of the items in the catalogue at different times. In practice thou, most recommenders fall short of this and tend to converge on a much smaller subset of products to recommend. This is usually a tell that the recommender is not doing a good job.
Low coverage tends to occur when:
- There are genuinely stand out products that are much better than the alternatives and do in fact dominate the alternatives. (For example, it is hard to recommend older computer games, given the improvements in graphics and computing power, you would expect the recommender not recommend these)
- There are positive feedback loops. Successful products are recommended, giving them more exposure, resulting in more success. New products are unable to break into the cycle and so tend to be ignored by the recommender.
- Insufficient information on the user. If all users look the same, from the perspective of the recommendation system, then they will all receive the same recommendation. This is particularly a problem when trying to predict the behaviour of new users to a site or on a site which only has a small or incomplete history (for example fashion retailers who only see a small part of your overall fashion spend)
- Insufficient information on the products. If all the products look the same, then again there is no way for a recommender to discriminate. We often see this in categories like fashion, where the information that is interesting to consumers resides in the images (rather than the text descriptions or the metadata of the product)
As you can see, it is not easy to define what we mean by a “good” recommendation, there are many factors to consider. Each business approaches this challenge from its own commercial perspective, shaped by the needs of its customers and suppliers. This is why 3rd Party, “out of the box” recommendation systems don’t tend to work well.
A business must come up with a rigours definition of “good” in order for any algorithm to be implemented and for us to be able to say, is one algorithm better than another.
Fortunately, the terms that we need to express ourselves exist, are well defined and provide a way for the business to state their needs in a way that data scientists can understand and implement.