RecList 2.0: Open-Source Systematic Testing of ML Models

A new RecList to provide more flexibility and better support for evaluation

Federico Bianchi

Published in

Towards Data Science

7 min readAug 8, 2023

Introduction

Evaluation is a complex matter. It is often hard to manage the different components that are involved in writing evaluation pipelines, you have your model somewhere, you need to load it, then get the test, then run the tests, and so on and so forth.

And then? well, you have to save results somewhere and maybe log outputs online so you can keep track of them.

As this is always a hard procedure, we recently tried to provide a more structured way to do testing. In this blog post, we introduce and show how to use RecList beta, our open-source package for evaluation; RecList is a general plug-and-play approach to scale-up testing, with an easy-to-extend interface for custom use cases. RecList is an open-source project freely available on GitHub.

Reclist allows you to separate the evaluation portion of your code and encapsulate it in a class that handles several other things (e.g., storage and logging) automatically for you.

Reclist offers a simple way to systematize testing and save all the information you need after you trained your own model.

We started working on RecList a couple of years ago and the alpha version of RecList came out a bit more than a year ago. Since then, RecList has collected over 400 GitHub stars.

We have used RecList and stress-tested it to run a RecSys challenge at CIKM in 2022 and currently preparing for an upcoming one at KDD 2023. RecList allowed us to systematize the evaluation for all the participants. The idea is that, once everybody is provided with the same RecList, comparing different evaluations becomes easy. A summary of our experience appears in our Nature Machine Intelligence comment piece.

RecList was originally introduced in an academic paper, but we also had a general overview that was presented in a Towards Data Science publication you can read here:

NDCG Is Not All You Need

Behavioral testing for recSys with RecList

towardsdatascience.com

Chia, P. J., Tagliabue, J., Bianchi, F., He, C., & Ko, B. (2022, April). Beyond nDCG: Behavioral Testing of Recommender Systems with Reclist. In Companion Proceedings of the Web Conference 2022 (pp. 99–104).

While we originally designed RecList for recommender system testing, nothing prevents using RecList for testing other machine learning models. So, why there is a new blog post? well, after developing the first version we realized it needed some updates.

What Did We Learn: Rethinking an API

It is often only after you build something that you realize how to improve it.

For those that used RecList 1.0, we have made major updates to the RecList API. Originally, we had harder constraints on code structure and input/output pairs.

Indeed, when we implemented Reclist we intended to provide a more general API for the evaluation of Recommender Systems that offered several out-of-the-box functionalities. However, to do this we had to create several abstract interfaces that users had to implement.

For example, the original Reclist 1.0 required users to wrap their own models and dataset into pre-defined abstract classes (i.e., RecModel and RecDataset). This allowed us to implement a common set of behaviors that were connected by these abstractions. However, we soon realized that this might often complicate flows and requires a lot of additional work that some people might not like.

In RecList 2.0 we decided to make these constraints optional: we made testing much more flexible. Users define their own evaluation use case, wrap it around a handy decorator, and they get metadata storage and logging already implemented. Users can then share the test interface with other people and they can run your very same experiments.

Summary of this: we realized how important flexibility is when we build software that other people have to use.

RecList 2.0 In Action

Now, let’s explore a simple use case on how to use RecList to write and run an evaluation pipeline. We are going to use very simple models that output numbers at random to reduce the complexity that is involved in making a machine learning project.

A Simple Use Case

Let’s create a very simple use case with a very simple dataset. Let’s assume we have a target sequence of integers, each with an associated category. We are simply going to generate some random data.

n = 10000

target = [randint(0, 1) for _ in range(n)]
metadata = {"categories": [choice(["red", "blue", "yellow"]) 
                          for _ in range(n)]}

Our very simple dataset should look something like this:

>>> target

[0, 1, 0, 1, 1, 0]

>>> metadata

{"categories" : ["red", "blue", "yellow", "blue", "yellow", "yellow"]}

A Simple Model

Let’s now assume we have a DummyModel that outputs integers at random. Of course, as we said, this is not a “good” model, but it’s a good abstraction we can use to see an entire evaluation pipeline.

class DummyModel:
  def __init__(self, n):
          self.n = n

      def predict(self):
          from random import randint
          return [randint(0, 1) for _ in range(self.n)]

simple_model = DummyModel(n)

# let's run some predictions
predictions = simple_model.predict()

Now, how do we run evaluations?

A Simple RecList

A RecList is a Python class that inherits functionalities from our RecList abstract class. RecList implements RecTests, simple abstractions that allow you to systematize evaluation. For example, this could be a possible accuracy test.

@rec_test(test_type="Accuracy", display_type=CHART_TYPE.SCALAR)
def accuracy(self):
    """
    Compute the accuracy
    """
    from sklearn.metrics import accuracy_score

    return accuracy_score(self.target, self.predictions)

We are taking sklearn accuracy metric and wrapping it in another method. What makes this different from a simple accuracy function? well, the decorator allows us to bring over some additional features: for example, the rectest will now automatically store information in a local folder. Also, defining a type of chart allows us to create some visualizations for these results.

What if we wanted a more sophisticated test? For example, what if we want to see how stable is our accuracy across the different categories (e.g., is the accuracy computed on red objects higher than for yellow objects?)

@rec_test(test_type="SlicedAccuracy", display_type=CHART_TYPE.SCALAR)
def sliced_accuracy_deviation(self):
    """
    Compute the accuracy by slice
    """
    from reclist.metrics.standard_metrics import accuracy_per_slice
    
    return accuracy_per_slice(
        self.target, self.predictions, self.metadata["categories"])

Let’s now look at an example of a complete RecList!

class BasicRecList(RecList):

    def __init__(self, target, metadata, predictions, model_name, **kwargs):
        super().__init__(model_name, **kwargs)
        self.target = target
        self.metadata = metadata
        self.predictions = predictions

    @rec_test(test_type="SlicedAccuracy", display_type=CHART_TYPE.SCALAR)
    def sliced_accuracy_deviation(self):
        """
        Compute the accuracy by slice
        """
        from reclist.metrics.standard_metrics import accuracy_per_slice

        return accuracy_per_slice(
            self.target, self.predictions, self.metadata["categories"]
        )

    @rec_test(test_type="Accuracy", display_type=CHART_TYPE.SCALAR)
    def accuracy(self):
        """
        Compute the accuracy
        """
        from sklearn.metrics import accuracy_score

        return accuracy_score(self.target, self.predictions)

    @rec_test(test_type="AccuracyByCountry", display_type=CHART_TYPE.BARS)
    def accuracy_by_country(self):
        """
        Compute the accuracy by country
        """
        # TODO: note that is a static test, 
        # used to showcase the bin display

        from random import randint
        return {"US": randint(0, 100), 
                "CA": randint(0, 100), 
                "FR": randint(0, 100)}

Few lines of code are needed to put all we need in one single place. We can reuse this piece of code for new models, or add tests and re-run past models.

As long as your metrics return some values, you can implement them in any way you like. For example, this BasicRecList evaluates a specific model in a specific context. But nothing prevents you from generating more model-specific reclists (e.g., GPT-RecList) or dataset-specific reclists (e.g., IMDB-Reclist). If you want to see an example of a deep model on RecList, you can check out this colab.

Running and Getting The Outputs

Let’s run the RecList. We need our target data, the metadata, and the predictions. We can also specify a logger and a metadata store.

rlist = BasicRecList(
    target=target,
    metadata=metadata,
    predictions=predictions,
    model_name="myRandomModel",
)

# run reclist
rlist(verbose=True)

What’s the output of this procedure? What we are going to see in our command line is the following set of results: for each test, we have an actual score.

The metrics are also automatically plotted. For example, the AccuracyByCountry should show something like this:

In addition to this, RecList saves a JSON file that contains all the information from the experiments we have just run:

{
  "metadata": {
    "model_name": "myRandomModel",
    "reclist": "BasicRecList",
    "tests": [
      "sliced_accuracy",
      "accuracy",
      "accuracy_by_country"
    ]
  },
  "data": [
    {
      "name": "SlicedAccuracy",
      "description": "Compute the accuracy by slice",
      "result": 0.00107123176804103,
      "display_type": "CHART_TYPE.SCALAR"
    },
...
}

The nice thing is that with a few lines of additional code, most of the logging is taken care of for us!

Using Online Loggers and Metadata Storages

By default, the RecList runner is going to use the following logger and metadata setup.

logger=LOGGER.LOCAL,
metadata_store= METADATA_STORE.LOCAL,

However, nothing prevents us from using online and cloud solutions. For example, we wrap around both CometML and Neptune APIs so that you can directly use them in your evaluation pipeline. We also offer support for S3 data storage.

For example, adding a couple of parameters to the BasicReclist will allow us to log information on Neptune (we offer similar support for Comet.ml)!

rlist = BasicRecList(
    target=target,
    model_name="myRandomModel",
    predictions=predictions,
    metadata=metadata,
    logger=LOGGER.NEPTUNE,
    metadata_store= METADATA_STORE.LOCAL,
    NEPTUNE_KEY=os.environ["NEPTUNE_KEY"],
    NEPTUNE_PROJECT_NAME=os.environ["NEPTUNE_PROJECT_NAME"],
)

# run reclist
rlist(verbose=True)

In a very similar way, adding the following:

bucket=os.environ["S3_BUCKET"]

will allow us to use an S3 Bucket to store metadata (of course, you will need to set some environment keys even for this).

Conclusion

That’s all! We have created RecList to make the evaluation in Recommender Systems more systematic and organized. We hope this large API refactoring can help people build more reliable evaluation pipelines!

Acknowledgments

Between June and December 2022, the development of our beta has been supported by the awesome folks at Comet, Neptune, Gantry, and developed with the help of Unnati Patel.