How it started

It was March, when COVID-19 pandemic hit us. We at Fynd were like any other e-Commerce business platform whose main revenue had taken a significant hit. We had to reduce our fixed monthly costs, during which we found #AWS bills taking up the highest spot in the monthly billing cycle. On analysing our spend reports from #AWS, we found out that #Sagemaker was chugging a huge amount of our bill. In fact, it was taking close to 4 times what our total Kubernetes infrastructure was taking up. This was mainly because there were many instances that had been spun up and were running constantly for experimentation.

There were times where our data science team needed just 2 GPUs, but they had to end up taking a 4 GPU notebook instance just because AWS didn’t have instances with the 2 GPU configuration.

At the time was no easy fix for the GPUs being underutilised.

Shared Instances

We thought of using shared instances to prevent unnecessary spinning up of instances. But, everyone had their own way of working and their own method of managing all the code and data in their notebook instances. Sharing a common instance to reduce costs didn’t seem like the best option.

We still did try to push for shared instances, but it was often met with resent and reluctance; which was a completely understandable and normal reaction considering not everyone might be open sharing their work space, let alone confine it to a specific folder. To add on top of that, there were chances for package versions mismatch which could be mitigated, with virtual environments; but again everyone had become so used to having their own work space with full control of it.

Considering this, I started looking out for other options that allowed us to reduce the costs without taking a hit on our work.

Spot Instance Savings

The first thing suggested to me was using spot instances for all ML training/experimentation workloads.

For those who are not familiar, AWS offers unutilized instances at really cheap rates (close to 50-80% reduced price compared to OnDemand). We had recently worked on adding spot instances to our Kubernetes cluster and it did help reduce our bills significantly. The only catch is that they have the right to reclaim instances with an excruciatingly short notice period of 2 minutes. Read more about spot instances

Considering everyone had become familiar with the Sagemaker environment, I looked into spot purchase options for our notebook instances. But at the time of writing this post, there was no option to spin up a Sagemaker Notebook Instance in Spot Configuration. There was an option for Spot Instance Training by spinning up a Sagemaker training jobs but considering experimentation taking up more of our time as compared to Spot instances we needed a cheaper alternative for GPU nodes with Jupyter notebooks running in them.

Projected Savings

Assuming that you run 2x p3.2xlarge and 1x p3.8xlarge constantly for a whole month in N. Virginia (the cheapest region available), these is how much you could be saving by just switching to spot instances

Instance Type	GPU Count	Sagemaker OnDemand Hourly	OnDemand Hourly	Spot Price (at time of writing)	Calculated 65% discount	Calculated CEIL MaxPrice
p3.2xlarge	1	$4.28	$3.06	$0.92	$1.07	$1.20
p3.8xlarge	4	$17.14	$12.24	$3.67	$4.28	$4.40

Instance Type	Hours Run	Instance count	Sagemaker Monthly	OnDemand Monthly	Spot Monthly (at CEIL max price)
p3.2xlarge	720	2	$6,168.96	$4,406.40	$1,728.00
p3.8xlarge	720	1	$12,337.92	$8,812.80	$3,168.00

Alternatives

With our past experience in adding spot instances to Kubernetes, I was fairly confident that this would be able to get our costs down significantly. So I decided to find Kubernetes native options that allowed us to use GPU instances with spot pricing.

After some research, I narrowed down to two tools, Kubeflow and Polyaxon

Polyaxon

Polyaxon is an opensource that would be able to do Experiment tracking across multiple projects, was Kubernetes native and had features that would be quite useful. It even had integrations that helped us plugin data stores, notification alerts, etc. It seemed like the perfect end-to-end tool, but it was still slightly premature. We deployed it and got it up and running, but considering the user experience, having to write YAML to just spin up a notebook server seemed really cumbersome.

Kubeflow

Kubeflow’s aim is short and simple.

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

It was built by Google which was how Tensorflow ran internally, which was then open-sourced. It was also Kubernetes Native and the whole experience of running a notebook server seemed highly polised as compared to Polyaxon.

Considering easy on boarding with this alternative, we decided to go with Kubeflow.

Perks

There were many perks of running Kubeflow, some of which are highlighted below:

Resource sharing/Reduced compute waste

If an instance with 4 GPU is already running, and only one GPU is being consumed, other user’s notebooks would be scheduled to the same instance to optimise resource utilisation to it’s maximum potential.

Spot instance support

Kubernetes natively being really robust at handling spot instances and moving workloads on instance termination made it super easy to use spot instances.

Isolated work environments

No more fuss of having to create virtual environments or confining to specific work-space folders.

Getting it up and running

I setup an EKS cluster with Spot instances and deployed Kubeflow on it.

To know more about how we got it up and running, checkout How we reduced our training costs by 78%

Here’s the supporting config files repo, which you could tweak or use right out of the box!

Signing off

This was definitely an amazing experience and I’m glad I got to contribute a significant part in our cost reduction drive.

Checkout more of my work at fynd.

Connect with me/Find me online

Last modified on 2020-09-13