Building our Feature Store to speed up our machine learning predictions

A detailed look into the construction of our custom Feature Store

Gabriela Melo
9 min readSep 28, 2021

Também disponível em Português

In the first post in this series, I talked about how we used a Python package to improve the speed of shipping new features while also reducing the risk of introducing training-serving skew at Legiti. In this post I’m going to dive into the latency issue we started facing in our model serving service and what we did to address it.

As I mentioned in that post, we had code in our Python package for calculating all of the features required by our models for evaluating a transaction. This package would get installed as a requirement in our evaluation service, and we would run the feature calculation process, using code from the package, for each order that we received. To do that, we needed to query our database for all orders that are “related” to the order under evaluation; related orders being those that share some type of information with the current order: they are made by the same user, or using the same document, or buying the same product, etc.

As you can imagine, not only is this pretty inefficient (as we were making very similar queries and calculations repeatedly, on the fly, in our service), but it also does not scale well with an increase in data size; especially because we needed to query and calculate features from our full order history. This evaluation endpoint gets hit on by our customers in the middle of their sale process (which might be synchronous), so it is a critical endpoint and under strict latency expectations, as it causes a delay observed directly by their end-users. With the query and feature calculation processes being executed during the request, there was a possibility that our latency would be higher than our customers would accept, and this possibility would increase with the passage of time and the collection of more data. It became clear to us then that we would need some way to start pre-computing, if not all, then at least some of these features.

What we needed in our Feature Store

That’s when the idea of using a feature store to help us with those issues came up. At the time, feature stores were starting to get noticed and mentioned by other companies, but there wasn’t much information or examples of feature stores available yet. Also, there were few open-source or enterprise solutions, and some of them seemed to have a bit more than we would actually need, so it felt like going with them might add some overhead that would not make sense for us at the time.

We then decided to build our custom feature store. As I mentioned, in the beginning, it wasn’t even clear to us what exactly a feature store was, how it worked, how we would calculate our features, and so on, so we spent a lot of time on the beginning just to try to understand what this would be. When doing so, we started to establish a few definitions that would be the foundation of our feature store.

Types of features we have at Legiti

This is the first thing that we needed to make sure was extremely clear to everyone while we started discussing our feature store — understanding the types of features we have and how they get calculated would directly inform the development of our feature store.

Most of the features we use are what we call velocity features, which are also the slowest to calculate. These are features that are based on counts in a given time frame. To help explain them, let’s look at the two different types of velocity features that we have.

The first one is based on simple counts: “how many times has this document been associated with orders in the past 7 days?”, “how many times has this phone number been associated with orders in the full history of orders that we have?”, “how many times has this zip code been associated with chargebacks in the past 6 hours?”, and so forth.

The second type of velocity feature is a little bit trickier to calculate: we count how many different identifiers for a certain entity has a given identifier been associated with. Making it clearer: “how many different zip codes has this document been associated with in orders in the past 7 days?”, “how many different users has this phone number been associated with in orders in the full history of orders that we have?”, “how many different phone numbers has this zip code been associated with in chargebacks in the past 6 hours?”.

This means that calculating our features involves getting a lot of related data from our database, in order to calculate those counts. We use both time ranges that look at only very recent data (10 minutes prior to the order date and time) and time ranges that look at data up until the distant past (we have velocity features looking at “ever”, that is, the furthest point in time from which we have customer data). This leads us to the next point.

Distant past and recent past features

After making clear the types of features we had, one of the first things we did was to start distinguishing between two different types of velocity features, based on the different time periods used to calculate those features.

  • The first type is related to the distant past — these features can only use data up until a certain point in time for their calculation. This point in time is relative to each order timestamp. For example, we can use data up until the last midnight to calculate a feature value for a distant past feature for an order happening this afternoon.
  • The second type is related to the recent past — these can only use data starting from a certain point in time, also relative to each order timestamp. For example, using data from 24 hours before the order time to get to the feature value.

Given that distant past features don’t depend on very recent data in relation to the order timestamp, these features can be calculated with a fixed frequency. Also, the calculation of these features is more expensive (as it needs more data), so we decided that this type of feature would be the first to get put into our feature store. This means that what we implemented is: when an order comes in for evaluation, distant past features are getting consumed from the feature store, and now only recent past features are getting calculated on the fly at our order evaluation service (we still use our Python package for the calculation of those features).

Pre-computing feature values

In the section above I mention that we could start pre-computing our distant past feature values. However, how can we calculate feature values for an order if it hasn’t happened yet?

The reason we can do that is that our velocity features are related to what we call entity identifiers, which are present in orders. For example, one possible entity that we use is a user’s zip code. So what we can do is pre-calculate all feature values for all zip code values that we have seen before. Therefore, even if an order for that zip code hasn’t happened yet today, we can have already calculated its zip code feature values, as it was present in our history of orders.

For zip codes that haven’t shown up yet, all feature values related to them will be 0 anyway, so when we try fetching a feature value for a given identifier from the feature store and it’s not present, we know that we can just use 0 as that feature’s value.

This means that for any order happening today, what we need is the current feature value for all the entity identifiers present in that order.

Access to feature values

The next thing we understood was that there are two different types of access to feature values that happen, each of them having different requirements:

  • At production inference time we need extremely fast access to the current feature value;
  • At model development and training time we need to access feature values for all orders, with point-in-time consistency (being able to “time travel”) but without a strict latency requirement.

To be able to allow for these different access types to feature values, we established that we’d have two different stores. Following the common naming in feature stores, we called these our online and offline stores. The online one would be based on storage solutions that yield very quick access to data (Cassandra, Redis, etc). This is what provides current feature values and gets used for feeding production inference requests. The offline one would be based on storage solutions for storing large amounts of data (commonly used solutions are Hive, S3, HDFS) and is what feeds feature values for model training and development.

One common source of confusion is that, even though we have two different types of features (distant past and recent past features), and two different stores (online and offline), both stores can contain values for both types of features.

Feature computation processes

As mentioned at the end of the last section, we need to feed both our online and our offline stores with features from updated data continuously, so we have recurrent jobs for doing so.

To be able to feed our online store with distant past feature values that will be used for evaluations throughout the day, we have a daily job, running during the night, that calculates all those values and stores them in our online store. It also stores all features in our offline store, for traceability.

For generating training datasets, one thing common for many places using feature stores is that they have different teams re-using features. Then, to allow different teams to generate different training datasets from data in the feature store, what they usually provide is a means to extract features with point-in-time consistency from the offline store and join them in order to form the dataset that the model needs with all the required features. However, in our case, we have a single format of a dataset that we need to form from the features in the offline store. Therefore, we also run a job to store in our offline store the datasets themselves that are used for model training, preventing us from needing to keep doing point-in-time consistent joins at feature extraction time. That means we have this separate feature computation job running periodically according to model re-training frequency. This also saves us from the task of needing to write a separate backfill job, as this job will already calculate feature values for all orders in the past, and this also simplifies feature versioning.

Tooling in our Feature Store

Having those definitions clear, we can now look at what the architecture that we developed looks like.

Our feature store architecture

All data that we use for feature computation comes from our DB, which is a PostgreSQL database on RDS. Both of our feature calculation jobs run on EMR clusters, with PySpark, and get triggered periodically with CircleCI. Our offline store is an S3 bucket, and our online one is a Redis cluster on ElastiCache. Our order evaluation service runs as a Python API, using Flask, in Kubernetes, and our model training processes run from EC2, triggered both manually and on a scheduled basis. Many of our decisions were guided by the size of our team — we are still a small team and need things that don’t require too much work setting up or maintaining, hence the frequent usage of AWS-managed solutions in our architecture.

Conclusions

The implementation of our feature store brought great improvements in latency metrics in our evaluation endpoint — the average duration of the requests decreased by more than a third. It also solved a parallel problem we were having, of decreasing feature computation time during experimentation and model training, as now our data scientists can access the pre-calculated values for our distant past features.

We’ve been really happy with our feature store solution so far, and now we’d like to take it even further. Some of the improvements we’re thinking about include making it more user-friendly for our data scientists to iterate and create new features with it and starting to include recent past features in our feature store, by making usage of streaming computation processes.

Other resources

When implementing our feature store, we relied heavily on some good resources we could find online. If you want a recommendation of other things to read, we suggest this very comprehensive article by Uber on their machine learning platform. We also consulted this website a lot, which lists many of the open feature store solutions that companies have implemented. If you’re looking for more on why feature stores are useful, this article might help you. We also consulted this series of articles, which start out by explaining some concepts around feature stores and then go on to explain how some of the publicly disclosed feature stores at the time had been implemented.

--

--