How SumUp constructed a low-latency characteristic retailer utilizing Amazon EMR and Amazon Keyspaces

This submit was co-authored by Vadym Dolin, Information Architect at SumUp. In their very own phrases, SumUp is a number one monetary know-how firm, working throughout 35 markets on three continents. SumUp helps small companies achieve success by enabling them to simply accept card funds in-store, in-app, and on-line, in a easy, safe, and cost-effective approach. In the present day, SumUp card readers and different monetary merchandise are utilized by greater than 4 million retailers world wide.

The SumUp Engineering crew is dedicated to growing handy, impactful, and safe monetary merchandise for retailers. To meet this imaginative and prescient, SumUp is more and more investing in synthetic intelligence and machine studying (ML). The inner ML platform in SumUp allows groups to seamlessly construct, deploy, and function ML options at scale.

One of many central parts of SumUp’s ML platform is the net characteristic retailer. It permits a number of ML fashions to retrieve characteristic vectors with single-digit millisecond latency, and allows software of AI for latency-critical use instances. The platform processes lots of of transactions each second, with quantity spikes throughout peak hours, and has regular development that doubles the variety of transactions yearly. Due to this, the ML platform requires its low-latency characteristic retailer to be additionally extremely dependable and scalable.

On this submit, we present how SumUp constructed a millisecond-latency characteristic retailer. We additionally talk about the architectural issues when establishing this resolution so it may possibly scale to serve a number of use instances, and current outcomes showcasing the setups efficiency.

Overview of resolution

To coach ML fashions, we want historic knowledge. Throughout this part, knowledge scientists experiment with totally different options to check which of them produce the most effective mannequin. From a platform perspective, we have to help bulk learn and write operations. Learn latency isn’t vital at this stage as a result of the info is learn into coaching jobs. After the fashions are skilled and moved to manufacturing for real-time inference, we’ve the next necessities for the platform change: we have to help low-latency reads and use solely the most recent options knowledge.

To meet these wants, SumUp constructed a characteristic retailer consisting of offline and on-line knowledge shops. These had been optimized for the necessities as described within the following desk.

Information Retailer Historical past Necessities ML Workflow Necessities Latency Necessities Storage Necessities Throughput Necessities Storage Medium
Offline Total Historical past Coaching Not vital Value-effective for big volumes Bulk learn and writes Amazon S3
On-line Solely the most recent Options Inference Single-digit millisecond Not vital Learn optimized Amazon Keyspaces

Amazon Keyspaces (for Apache Cassandra) is a serverless, scalable, and managed Apache Cassandra–appropriate database service. It’s constructed for constant, single-digit-millisecond response instances at scale. SumUp makes use of Amazon Keyspaces as a key-value pair retailer, and these options make it appropriate for his or her on-line characteristic retailer. Delta Lake is an open-source storage layer that helps ACID transactions and is absolutely appropriate with Apache Spark, making it extremely performant at bulk learn and write operations. You’ll be able to retailer Delta Lake tables on Amazon Easy Storage Service (Amazon S3), which makes it a great match for the offline characteristic retailer. Information scientists can use this stack to coach fashions in opposition to the offline characteristic retailer (Delta Lake). When the skilled fashions are moved to manufacturing, we change to utilizing the net characteristic retailer (Amazon Keyspaces), which affords the most recent options set, scalable reads, and far decrease latency.

One other vital consideration is that we write a single characteristic job to populate each characteristic shops. In any other case, SumUp must preserve two units of code or pipelines for every characteristic creation job. We use Amazon EMR and create the options utilizing PySpark DataFrames. The identical DataFrame is written to each Delta Lake and Amazon Keyspaces, which eliminates the hurdle of getting separate pipelines.

Lastly, SumUp needed to make the most of managed providers. It was vital to SumUp that knowledge scientists and knowledge engineers focus their efforts on constructing and deploying ML fashions. SumUp had experimented with managing their very own Cassandra cluster, and located it troublesome to scale as a result of it required specialised experience. Amazon Keyspaces supplied scalability with out administration and upkeep overhead. For working Spark workloads, we determined to make use of Amazon EMR. Amazon EMR makes it straightforward to provision new clusters and routinely or manually add and take away capability as wanted. You too can outline a customized coverage for auto scaling the cluster to fit your wants. Amazon EMR model 6.0.0 and above helps Spark model 3.0.0, which is appropriate with Delta Lake.

It took SumUp 3 months from testing out AWS providers to constructing a production-grade characteristic retailer able to serving ML fashions. On this submit we share a simplified model of the stack, consisting of the next elements:

  • S3 bucket A – Shops the uncooked knowledge
  • EMR cluster – For working PySpark jobs for populating the characteristic retailer
  • Amazon Keyspaces feature_store – Shops the net options desk
  • S3 Bucket B – Shops the Delta Lake desk for offline options
  • IAM position feature_creator – For working the characteristic job with the suitable permissions
  • Pocket book occasion – For working the characteristic engineering code

We use a simplified model of the setup to make it straightforward to comply with the code examples. SumUp knowledge scientists use Jupyter notebooks for exploratory evaluation of the info. Characteristic engineering jobs are deployed utilizing an AWS Step Capabilities state machine, which consists of an AWS Lambda operate that submits a PySpark job to the EMR cluster.

The next diagram illustrates our simplified structure.


To comply with the answer, you want sure entry rights and AWS Id and Entry Administration (IAM) privileges:

  • An IAM consumer with AWS Command Line Interface (AWS CLI) entry to an AWS account
  • IAM privileges to do the next:
    • Generate Amazon Keyspaces credentials
    • Create a keyspace and desk
    • Create an S3 bucket
    • Create an EMR cluster
    • IAM Get Function

Arrange the dataset

We begin by cloning the challenge git repository, which comprises the dataset we have to place in bucket A. We use an artificial dataset, beneath Information/daily_dataset.csv. This dataset consists of vitality meter readings for households. The file comprises info just like the variety of measures, minimal, most, imply, median, sum, and std for every family every day. To create an S3 bucket (should you don’t have already got one) and add the info file, comply with these steps:

  1. Clone the challenge repository regionally by working the shell command:
    git clone

  2. On the Amazon S3 console, select Create bucket.
  3. Give the bucket a reputation. For this submit, we use featurestore-blogpost-bucket-xxxxxxxxxx (it’s useful to append the account quantity to the bucket title to make sure the title is exclusive for widespread prefixes).
  4. Select the Area you’re working in.
    It’s vital that you simply create all sources in the identical Area for this submit.
  5. Public entry is blocked by default, and we advocate that you simply preserve it that approach.
  6. Disable bucket versioning and encryption (we don’t want it for this submit).
  7. Select Create bucket.
  8. After the bucket is created, select the bucket title and drag the folders Dataset and EMR into the bucket.

Arrange Amazon Keyspaces

We have to generate credentials for Amazon Keyspaces, which we use to attach with the service. The steps for producing the credentials are as follows:

  1. On the IAM console, select Customers within the navigation pane.
  2. Select an IAM consumer you wish to generate credentials for.
  3. On the Safety credentials tab, beneath Credentials for Amazon Keyspaces (for Apache Cassandra), select Generate Credentials.
    A pop-up seems with the credentials, and an choice to obtain the credentials. We advocate downloading a replica since you gained’t be capable to view the credentials once more.We additionally must create a desk in Amazon Keyspaces to retailer our characteristic knowledge. We’ve got shared the schema for the keyspace and desk within the GitHub challenge recordsdata Keyspaces/keyspace.cql and Keyspaces/Table_Schema.cql.
  4. On the Amazon Keyspaces console, select CQL editor within the navigation pane.
  5. Enter the contents of the file Keyspaces/Keyspace.cql within the editor and select Run command.
  6. Clear the contents of the editor, enter the contents of Keyspaces/Table_Schema.cql, and select Run command.

Desk creation is an asynchronous course of, and also you’re notified if the desk is efficiently created. You too can view it by selecting Tables within the navigation pane.

Arrange an EMR cluster

Subsequent, we arrange an EMR cluster so we will run PySpark code to generate options. First, we have to arrange a belief retailer password. A truststore file comprises the Software Server’s trusted certificates, together with public keys for different entities, this file is generated by the offered script and we have to present a password for shielding this file. Amazon Keyspaces supplies encryption in transit and at relaxation to guard and safe knowledge transmission and storage, and makes use of Transport Layer Safety (TLS) to assist safe connections with shoppers. To connect with Amazon Keyspaces utilizing TLS, we have to obtain an Amazon digital certificates and configure the Python driver to make use of TLS. This certificates is saved in a belief retailer; after we retrieve it, we have to present the proper password.

  1. Within the file EMR/, replace the next line to a password you wish to use:
    # Create a JKS keystore from the certificates

  2. To level the bootstrap script to the one we uploaded to Amazon S3, replace the next line to replicate the S3 bucket we created earlier:
    # Copy the Cassandra Connector config
    aws s3 cp s3://{your-s3-bucket}/EMR/app.config /residence/hadoop/app.config

  3. To replace the app.config file to replicate the proper belief retailer password, within the file EMR/app.config, replace the worth for truststore-password to the worth you set earlier:
        ssl-engine-factory {
          class = DefaultSslEngineFactory
          truststore-path = "/residence/hadoop/.certs/cassandra_keystore.jks"
          truststore-password = "{your_password_here}"

  4. Within the file EMR/app.config, replace the next traces to replicate the Area and the consumer title and password generated earlier:
    contact-points = ["cassandra.<your-region>"]
    load-balancing-policy.local-datacenter = <your-region>
    auth-provider {
        class = PlainTextAuthProvider
        username = "{your-keyspace-username}"
        password = "{your-keyspace-password}"

    We have to create default occasion roles, that are wanted to run the EMR cluster.

  5. Replace the contents S3 bucket created within the pre-requisite part by dragging the EMR folder into the bucket once more.
  6. To create the default roles, run the create-default-roles command:
    aws emr create-default-roles

    Subsequent, we create an EMR cluster. The next code snippet is an AWS CLI command that has Hadoop, Spark 3.0, Livy and JupyterHub put in. This additionally runs the bootstrapping script on the cluster to arrange the connection to Amazon Keyspaces.

  7. Create the cluster with the next code. Present the subnet ID to start out a Jupyter pocket book occasion related to this cluster, the S3 bucket you created earlier, and the Area you’re working in. You’ll be able to present the default Subnet, and to seek out this navigate to VPC>Subnets and duplicate the default subnet id.
    aws emr create-cluster --termination-protected --applications Title=Hadoop Title=Spark Title=Livy Title=Hive Title=JupyterHub --tags 'creator=feature-store-blogpost' --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"your-subnet-id"}' --service-role EMR_DefaultRole --release-label emr-6.1.0 --log-uri 's3n://{your-s3-bucket}/elasticmapreduce/' --name 'emr_feature_store' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Title":"Core - 2"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Title":"Grasp - 1"}]' --bootstrap-actions '[{"Path":"s3://{your-s3-bucket HERE}/EMR/","Name":"Execute_bootstarp_script"}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region your-region

    Lastly, we create an EMR pocket book occasion to run the PySpark pocket book Characteristic Creation and loading-notebook.ipynb (included within the repo).

  8. On the Amazon EMR console, select Notebooks within the navigation pane.
  9. Select Create pocket book.
  10. Give the pocket book a reputation and select the cluster emr_feature_store.
  11. Optionally, configure the extra settings.
  12. Select Create pocket book.It could take a couple of minutes earlier than the pocket book occasion is up and working.
  13. When the pocket book is prepared, choose the pocket book and select both Open JupyterLab or Open Jupyter.
  14. Within the pocket book occasion import, open the pocket book Characteristic Creation and loading-notebook.ipynb (included within the repo) and alter the kernel to PySpark.
  15. Observe the directions within the pocket book and run the cells one after the other to learn the info from Amazon S3, create options, and write these to Delta Lake and Amazon Keyspaces.

Efficiency testing

To check throughput for our on-line characteristic retailer, we run a simulation on the options we created. We simulate roughly 40,000 requests per second. Every request queries knowledge for a selected key (an ID in our characteristic desk). The method duties do the next:

  • Initialize a connection to Amazon Keyspaces
  • Generate a random ID to question the info
  • Generate a CQL assertion:
    SELECT * FROM feature_store.energy_data_features WHERE id=[list_of_ids[random_index between 0-5559]];

  • Begin a timer
  • Ship the request to Amazon Keyspaces
  • Cease the timer when the response from Amazon Keyspaces is acquired

To run the simulation, we begin 245 parallel AWS Fargate duties working on Amazon Elastic Container Service (Amazon ECS). Every process runs a Python script that makes 1 million requests to Amazon Keyspaces. As a result of our dataset solely comprises 5,560 distinctive IDs, we generate 1 million random numbers between 0–5560 in the beginning of the simulation and question the ID for every request. To run the simulation, we included the code within the folder Simulation. You’ll be able to run the simulation in a SageMaker pocket book occasion by finishing the next steps:

  1. On the Amazon SageMaker console, create a SageMaker pocket book occasion (or use an present one).You’ll be able to select an ml.t3.giant occasion.
  2. Let SageMaker create an execution position for you should you don’t have one.
  3. Open the SageMaker pocket book and select Add.
  4. Add the Simulation folder from the repository. Alternatively, open a terminal window on the pocket book occasion and clone the repository
  5. Observe the directions and run the steps and cells within the Simulation/ECS_Simulation.ipynb pocket book.
  6. On the Amazon ECS console, select the cluster you provisioned with the pocket book and select the Duties tab to watch the duties.

Every process writes the latency figures to a file and strikes this to an S3 location. When the simulation ends, we accumulate all the info to get aggregated stats and plot charts.

In our setup, we set the capability mode for Amazon Keyspaces to Provisioned RCU (learn capability items) at 40000 (mounted). After we begin the simulation, the RCU rise near 40000. After we begin the simulation, the RCU (learn capability items) rise near 40000, and the simulation takes round an hour to complete, as illustrated within the following visualization.

The primary evaluation we current is the latency distribution for the 245 million requests made throughout the simulation. Right here the 99% percentile falls inside single-digit millisecond latency, as we might count on.

Quantile Latency (ms)
50% 3.11
90% 3.61
99% 5.56
99.90% 25.63

For the second evaluation, we current the next time sequence charts for latency. The chart on the backside exhibits the uncooked latency figures from all of the 245 employees. The chart above that plots the common and minimal latency throughout all employees grouped over 1-second intervals. Right here we will see each the minimal and the common latency all through the simulation stays under 10 milliseconds. The third chart from the underside plots most latency throughout all employees grouped over 1-second intervals. This chart exhibits occasional spikes in latency however nothing constant we have to fear about. The highest two charts are latency distributions; the one on the left plots all the info, and the one on the best plots the 99.9% percentile. Because of the presence of some outliers, the chart on the left exhibits a peak near zero and a really tailed distribution. After we take away these outliers, we will see within the chart on the best that 99.9% of requests are accomplished in lower than 5.5 milliseconds. This can be a nice end result, contemplating we despatched 245 million requests.


A number of the sources we created on this blogpost would incur prices if left working. Bear in mind to terminate the EMR cluster, empty the S3 bucket and delete it, delete the Amazon KeySpaces desk. Additionally delete the SageMaker and Amazon EMR notebooks. The Amazon ECS cluster is billed on duties and wouldn’t incur any further prices.


Amazon EMR, Amazon S3, and Amazon Keyspaces present a versatile and scalable improvement expertise for characteristic engineering. EMR clusters are straightforward to handle, and groups can share environments with out compromising compute and storage capabilities. EMR bootstrapping makes it straightforward to put in and take a look at out new instruments and rapidly spin up environments to check out new concepts. Having the characteristic retailer cut up into offline and on-line retailer simplifies mannequin coaching and deployment, and supplies efficiency advantages.

In our testing, Amazon Keyspaces was capable of deal with peak throughput learn requests inside our desired requirement of single digit latency. It’s additionally value mentioning that we discovered the on-demand mode to adapt to the utilization sample and an enchancment in learn/write latency a few days from when it was switched on.

One other vital consideration to make for latency-sensitive queries is row size. In our testing, tables with decrease row size had decrease learn latency. Due to this fact, it’s extra environment friendly to separate the info into a number of tables and make asynchronous calls to retrieve it from a number of tables.

We encourage you to discover including safety features and adopting safety finest practices in response to your wants and potential firm requirements.

For those who discovered this submit helpful, take a look at Loading knowledge into Amazon Keyspaces with cqlsh for tips about easy methods to tune Amazon Keyspaces, and Orchestrate Apache Spark purposes utilizing AWS Step Capabilities and Apache Livy on easy methods to construct and deploy PySpark jobs.

Concerning the authors

Shaheer Mansoor is a Information Scientist at AWS. His focus is on constructing machine studying platforms that may host AI options at scale. His curiosity areas are ML Ops, Characteristic Shops, Mannequin Internet hosting and Mannequin Monitoring.

Vadym Dolinin is a Machine Studying Architect in SumUp. He works with a number of groups on crafting the ML platform, which allows knowledge scientists to construct, deploy, and function machine studying options in SumUp. Vadym has 13 years of expertise within the domains of information engineering, analytics, BI, and ML.

Oliver Zollikofer is a Information Scientist at AWS. He allows world enterprise clients to construct and deploy machine studying fashions, in addition to architect associated cloud options.

Leave a Comment