A Rounded Evaluation of Recommender Systems

EvalRS: Evaluating Recommender Systems on Many Tests

Federico Bianchi
8 min readAug 11, 2022


This article describes a novel data and code challenge that is currently running: EvalRS.

We decided to organize EvalRS with friends from Coveo, Microsoft, and NVIDIA, to better understand evaluation in Recommender Systems.

Anybody can join the challenge. There are prizes for best systems, best ideas and best student work. There is also the opportunity of presenting systems at the CIKM conference.

If you are interested in EvalRS and you want to join the challenge you can follow these links:

This blog posts covers a general introduction, motivations and some python code for EvalRS.


We all love RecSys.

Netflix, Spotify, LastFM, Amazon all need systems to improve user experience and recommendations to effectively suggest items. Indeed, if you are online you have surely been in close contact with a Recommender System. Also, if you are reading this now, you have probably used Medium Recommender System!

Spotify Apps. Image from Photo by Heidi Fin on Unsplash.

Recommender Systems are probably one of the most popular machine learning systems in production; moreover, recommenders are probably the machine learning products that have the closest contact with the actual users, since they are generally used in an interactive fashion.

This is why testing is fundamental in Recommender Systems. Failing to detect low performance in some cases can bring reputational damage to a company.

However, from the point of view of development, testing recommender is very difficult: online vs offline, retrieval vs ranking and so on. Sure, there are evaluation metrics, as it is common in the ML field.

Often recommenders are indeed evaluated on standard point-wise metrics like HITS and the MRR, which generally assess how accurate are, on average, predictions on held out data points.

The question is, are standard point-wise metrics enough? The answer is probably no.

Don’t get me wrong, these metrics are very important! As you need to use them to assess how your recommender works on average. However, there is something more to evaluation than just averages.

For example, some recommendations are better than others, how can we account for this? Generally, evaluation requires a more fine-grained approach in which you study the data you have.

Eventually, you need to know the data very well.

Rounded Evaluation?

“Rounded evaluation”? My friend Jacopo discussed the limitations of standard evaluation in his blog post.

Here I will briefly mention two aspects of the evaluation that are important to consider:

  1. Performance over slices. Once you train your model, does it perform equally well on UK users and US users?
  2. Being less wrong. When you predict the wrong item for a user, how wrong are you? Suggesting a horror movie to a user that wants to watch a lighthearted movie, is a bad idea. However, suggesting either a comedy or a romantic movie might be appreciated!

Our EvalRS challenge is built around this exact way of thinking: there is more to evaluation than just MRR and HITS. Thus we focus on evaluating Recommender Systems using a more fine-grained approach.

We have a dataset (see next sections) with tons of metadata that can be studied and analyzed from different perspectives.


How can we quickly introduce better testing in recommender systems? Well, we need RecList for this. RecList is a novel python package we have designed to support the evaluation of recommender systems.

RecList allows us to focus on writing models and let automated pipelines deal with the evaluation. EvalRS scripts are built on top of RecList to simply the challenge for participants.

The Challenge

The EvalRS dataset and general challenge focuses on the LastFM dataset (Schedl, 2016), available for non-commerical purposes, which contains interactions between users and songs.

This challenge is focused on three methodological pillars that are important to describe:

Avoid public leaderboard overfitting

If the leaderboard is public and multiple experiments can be run, it might be easy to just search for the best configuration by using the testset as a guide.

To this end, we use Bootstrapped k-fold CV instead of a fixed holdout test set. This will make it more difficult to overfit on the leaderboard and will allow us to provide a better evaluation methodology.

Avoid high-consumption ensemble solutions that cannot be deployed

Since this is a code competition, the final step of the evaluation will provide a fixed compute budget. This allows us to evaluate recommender systems that can be used and deployed in practice.

Avoid single metric chasing

We standardize a benchmark with many metrics that includes fairness and behavioral testing. This is going to make our evaluation much more complete.

Using this setup, we hope to ensure results are comparable and fair.

The Dataset

The LastFM dataset contains 37M interactions and it is rich in metadata. The task will be “suggesting to users new tracks to listen to”.

For users, we have demographic information, gender (binary), the time when they registered, and many more additional features.

User Dataframe. Image by author.

For each track we have access to artists and albums, so we can learn higher-level patterns (e.g., we can count how many albums an artist has and use this as an additional feature for our model).

Track Dataframe. Image by author.

Finally, the most valuable part of the entire dataset. The interactions. This portion of the data contains the interaction between users and tracks, with the timestamp of when the user has listened to the song. This will be our core element in building our recommender.

Interaction Dataframe. Image by author.

The dataset is super interesting: the amount of information available makes sure there is plenty that can be done to tackle the evaluation challenges. For example, look at the distribution of music consumption per hour of the day.

This picture below shows music consumption by hour of day. Image by author.

We can also explore the dataset and look at which songs we have there. I personally like “Enter Shikari”, and with a quick pandas search, I can find all the songs that have been incorporated into this dataset.

Examples of Enter Shikari songs in the dataset.

The Challenge Approach

Thanks to RecList and our wrappers you do not have to write much code. You just need to design your model. As long as your model is consistent with our API, you are good to go. The evaluation script is going to take your model in input and handle all the processes for you.

The only thing that is needed to join the challenge: a model. The structure is very simple. Image by author.

Once you have a model, you can simply pass it to our high-level APIs. These will take care of the training and the evaluation for you and will give you the results. It will also take care of pushing the results onto the leaderboard for you!

Our evaluator will run several tests. From tests on the partition to tests on the actual diversity and quality of your recommendations (e.g., the being less wrong one).

A CBOW-Based RecSys

Let’s try to solve the challenge! We will create a very simple baseline based on Word2Vec, we will call this CBOWRecSys.

Latent Space

Let’s first start with our latent space.

Our CBOWRecSys will make internal use of the CBOW algorithm: we essentially take users’ interactions with tracks in sequential order. Then, we run Word2Vec on this “corpus”.

Hopefully, the co-occurrence will allow us to learn a reliable vector space in which similar songs are close together in the space.

Word2vec is used to generate the latent space. Image by author.

Is this space meaningful? well, let’s see…We can ask the model which are the most similar songs to “Sorry You’re Not a Winner” (id 18581 in the dataset) by Enter Shikari.

Songs to “Sorry You’re Not a Winner” (id 18581 in the dataset) by Enter Shikari. Image by author.

These all make sense! Since they are all Enter Shikari songs.

One point one might want to discuss is that these space has low diversity: what if I wanted a song that is similar to “Sorry You’re Not a Winner”, but not from Enter Shikari? Well this is something a bit more complex, but it is also something we have tried to model in the challenge; read more about this here.


Finally, we can discuss our recommender. CBOWRecSys will be very simple, for the sake of computational time. For each user, we will:

  • randomly pick a set of tracks from their listening history;
  • generate a user vector representation taking the mean of the tracks vectors;
  • search the most similar tracks to the users’ representations.


Ok, let’s quickly look at the implementation of this model. Code is commented and mimics the process we have just discussed. You can also look at this directly on the Kaggle Notebook.

Code for our CBOWRecSys model to solve the CIKM 2022 EvalRS Challenge.

You see that the code is very simple. The only thing we really need to take care of is the return statement in the prediction. We need to return a pandas DataFrame that has a row for each user and 100 columns with the top-k recommendations for each user.

Running The Evaluation

Once we have the model, running the evaluation is very simple. We have a wrapper class for this. You can see the results commented in the following gist.

Evaluating our new RecSys.

This simple models already gets good results on some of the tests. For example, we have an HITS@100 of 0.01816 on the first fold. This is not too bad considering 1) how big the dataset is and 2) how simple the baseline implemented is.

The submission now appears on the leaderboard.


You find all the code on the starting Kaggle notebook! come and join the challenge! You may get the chance to present your work at a top-tier conference, meet new people in the RecSys community, share your insights through open source and of course win a prize: we award models that perform well, but also novel ideas for testing and outstanding student work

The model we have defined was pretty simple. However, our friend Gabriel from NVIDIA has provided an implementation of a two tower system that you can use to join the challenge. This is a good starting point if you want to fire up a GPU and implement a deep recommender!


I would like to thank our sponsors, Comet, Neptune and Gantry in further developing RecList.



Federico Bianchi

Stanford University. NLP, Machine Learning and Artificial Intelligence. https://federicobianchi.io