Bridging the gap between feature calculation for training and serving

How we used a feature generation Python package in our machine learning serving API to make shipping new features easier

9 min readJun 1, 2021

Legiti has been, from day one, a machine learning-based company. Unlike companies in which machine learning comes up as a possible tool for improving services and processes as the company evolves, machine learning lies at the very core of our product and is the source of the value we deliver to our customers.

What we do have in common with any other early-stage startup is that we began with a very small team; and if you’re trying to implement a wide variety of complex technological solutions, involving data collection, processing, storing, and modeling, with a limited number of people, you will quickly learn that you need to do things in a way that minimizes the amount of work and the number of resources required to ship product improvements.

In this post, I will explain how we built a feature generation Python package to be used in our order evaluation API. This API is the core of our product; it provides a reliable, timely interface for customers to request our fraud detection decisions.

Our feature generation package provides us with a simple and effective solution to help with the common problem in machine learning of needing to re-implement features between training and serving. This situation slows down the entire feature development and model deployment process and usually means that different people or different teams need to be involved in a new feature release. Furthermore, discrepancies between feature development for training and feature calculation during serving can lead to different feature values being used in training and serving, which can negatively impact model performance.

I will begin by explaining the situation we had, which will provide the foundation for understanding what we had to build. Then, I will talk about what our very first solution looked like, and from there we will get to what our feature generation package is and how it works.

Feature Generation

If you’re building a machine learning model that will be used for online inference — that is, it will not be used for generating batch predictions, but rather to receive requests one by one and provide individual predictions for each request — you will need to serve your model so that it can receive those requests and respond with the predictions or decisions. In most frequent scenarios, the input data that you will receive — let’s call it raw data — is not in the format that your model expects: your model needs features that are generated from that raw data. That means that you’ll need to have some code in between the raw data and the usage of the model to transform that raw data into model features.

Usually, you will have data scientists iterating on those features, trying to come up with new ideas that may help your model. That iteration cycle happens by using datasets with which you train and evaluate your model. However, as mentioned in the paragraph above, this mode of feature generation is quite different from that for serving requests. In this case, features are generated for an entire batch of data, whereas during model serving features get generated for each request separately. This makes it so that the environment and scenario in which model training and inference happens are quite different from each other.

Additionally, I mentioned that model serving code is usually run from an API, but experimentation code might live inside a Jupyter Notebook, and training code might be in a script that gets run manually or orchestrated as a job. Many times, experimental and model training code might not be in the same repository as model serving and data scientists may not be familiarized with API code and production serving frameworks.

On top of that, depending on the type of model and solution you are building, developing code for those features will not be a one-off thing; you will want to iterate on your model, using new features, or even changing how you are calculating features. Therefore, this might be a constant, iterative, task on which you will be working.

All these points contribute to some common issues surrounding feature generation code. You need to iteratively develop features, trying to quickly validate their value and ship them to production, but these two steps of writing experimental features and getting them into production might be two different pieces of the feature development lifecycle.

An additional problem that surfaces when you have different code in experimentation and production is that you might end up getting training-serving skew. That happens when you get slightly different features due to differences in the code that generates the features, and this can impact the decisions being made in production. Not only does this mean you can make worse decisions than those your model could provide had it received accurate feature values, but also that the performance you see in experimentation and training will not necessarily be a good proxy for production performance.

The differences in how features get developed for training and serving add a gap between a data scientist’s experimentative work and code that provides production predictions to your users, and here at Legiti we wanted — and needed — to close that gap, to allow for us to be able to ship feature additions or modifications smoothly and requiring the least amount of people and work possible. We also need very high reliability on the code we are putting in production as it is the main product we provide, so we did not want to leave room for training-serving skew to creep in. The next section will show a bit more of the very first system we had, how these issues showed up in it, and how we solved it.

Initial Model Serving Implementation

Our order evaluation service was required as part of our MVP (Minimum Viable Product), so this is code that we wrote very early on at Legiti. The initial code for that service had been developed after having a “v0” model generated. This model had been trained in a Jupyter Notebook, from which a pickle file came out, and then that serialized model needed to get into our order evaluation application (as we decided to serve models by embedding them into our evaluation service). At that moment, feature calculation code was implemented straight inside our order evaluation service, by a different person than the one who had implemented the model training code, and without having any code shared between how data had been extracted, prepared, and used for training the model and how it was implemented in this evaluation service. Also, to calculate the features in our evaluation service we use data not only from the request body but also from our database, by querying it with information from the request, to fetch data for related orders. This query code was also re-implemented for our evaluation service.

As our first data scientist joined the team and started working on model improvements, lots of new features were implemented, and features that had been part of the previous model were changed. Quickly, the code in our order evaluation service became completely outdated. If we hadn’t noticed that some features had had their implementation changed and some queries had been changed for training, we could have introduced training-serving skew — one of the problems listed in the section above.

As we picked up the task of getting that service up to date to our new models, we easily realized that, if every time we had a new feature developed by our data scientist we would need to re-implement code to bring that feature into the evaluation service, we would start a continuous loop of playing catch up between model serving and model training — the other problem listed in the section above.

One benefit that we had is that our feature generation code for experimentation and training was already being shared. All of our experimentation was run from Python scripts using code in Python modules (we were not using code in Jupyter Notebooks for this), so we would not have trouble by having our data scientists generating feature generation code in Python files instead of notebooks (as they were already doing that) and we just needed to find a way to improve codeshare between that and inference (our inference feature generation code was in another repository, so it wasn’t all part of the same code base). Moreover, we needed data to be in the same format of DataFrames in both training and inference, albeit in much smaller-sized DataFrames for production inference requests so that we wouldn’t have any issues with sharing the feature generation code between these two environments either.

Therefore, we just needed a way for using that same code for inference and training, but we wouldn’t need to introduce any other changes in our development lifecycle (like a change from Jupyter Notebooks to Python scripts) to be able to do that. This motivated us to create a Python PyPI package, which would be shared between experimentation/training and inference. This would remove the need for re-implementing code, as it could get implemented once, and shared between the environments. Below I will present in more detail how this package and its development cycle works.

Feature Generation Package

The Python package that we built is a library that now gets used at experimentation, training, and serving times at Legiti. Most of the work for creating it involved recognizing which pieces of code needed to get executed at both training and inference and then moving those pieces to a separate directory in our repository, which would get published as our package. The package mainly consists of code that accepts Pandas DataFrames containing data that we query from our database as inputs and runs all the required steps to get features that can then get passed on to our models.

We also use the package to share query code. Of course, the size of data we need in each environment is different: for training, we need all data; for inference, we need limited data, only what we need to generate features for the current request. To allow for that flexibility, we’re using JinjaSQL templates to differentiate the queries that get run for training and serving. This way, we can have the same query structure, just with additional filters for serving for limiting the amount of data that gets returned. Sharing this makes it so that the DataFrames gathered from the queries are in the correct format, ready to be passed on to the next steps in the pipeline for generating the feature values.

At experimentation and training, the package’s code is in the same repository as the rest of the code used for those steps, so it’s very easy to iterate on it while developing a new feature — we can just import the package’s code as Python modules, as it’s in our PYTHONPATH. For model serving, the code gets packaged and published to our internal PyPI service, and then we can pick it up in our order evaluation service; we install it as any other requirement.

Our serialized models get stored as pickles and those get put in our package too. This is a simple way to guarantee feature-model compatibility, as the feature generation code that we’ll use in production will correspond to the current model version, as they are in the same package. Integration tests help to guarantee this always remains true.

This package gets automatically published on merges to the main branch in our repository, using CircleCI. For versioning, we have a very simple script set up that automatically increments version number every time it publishes. Our main goal with this was to have something that uniquely links commits to package versions, without our developers having to think about it or manually track it, and it’s been serving us very well in that sense. For our internal PyPI server, we found that a simple solution can get you a long way. We’re using PyPICloud, running on an AWS EC2, and using AWS S3 as the backend.

Conclusions

Having this feature generation package was fundamental for the continuous delivery of new features as we implemented our first models and tried to validate product-market fit and the results that we would be able to bring to our customers. It allowed us, having an incredibly small team relative to the number of pieces our solution encompasses, to be able to quickly add new features in, painlessly change how features get calculated, and all that while preventing training-serving skew. This package became a centerpiece to the quality of the solution we are delivering with our order evaluation service.

Of course, the journey was far from finished once we had our package implemented and started using it. Our solution then started facing some latency issues due to the time required for calculating all of the features that we provide our models with. Stay tuned for part 2 of this series, in which we will describe how we used an internally-built feature store to help us overcome those latency difficulties!