Scaling TezLab: Tuning Kubernetes

Aug 1, 2022
TezLab is a companion platform on iOS and Android for Tesla owners to better understand their battery behavior, past trips, and comparative performance.
A bit over two years ago we migrated TezLab from Heroku to Kubernetes on Google Cloud. Overall, we’re quite happy with the move, as it has given us much better control over cost and plenty of room to experiment, which was crucial while scaling Tezlab.
With the continued success of TezLab we’ve had some recent growing pains. Tezlab lets users track their rides and logs a ton of data about their battery usage and where they’ve been to provide insights about their driving patterns and gamify the experience. As the platform grew, with more drivers added each month, we eventually ran into database performance issues that spurred us to make some significant changes.
Approach
In order to buy ourselves some time, initially we threw some more hardware at the problem. This reset the clock (sort of), but was not a long-term solution, however it did give us the time we needed to home in on the root-cause. At the time the issue cropped up, our DB memory was configured at 40GB, during our experimentation we scaled it up to 60GB, then 104GB (the maximum for the number of CPUs we had provisioned) and then later back down to 85GB (still a bit higher than it needs to be; it is good to leave yourself some headroom) once the issue was resolved. Being on GCP making these tweaks was trivial and incurred minimal cost.
Once the DB was stabilized we needed to further isolate the problem. The in-built facilities of GCP were a pretty good starting point for providing insight(s) into the health of the DB, Kubernetes Hosts, and Pods but we needed to go deeper. At this point we focused on making things configurable. This included memory settings / limits, number of replicas, worker concurrency, database connection count, etc. Each customization was a simple YAML tweak accompanied by a quick capistrano task execution. Had we still been on Heroku, control at this granularity — and with this speed — would likely have not been possible.
Eventually we turned our attention to the DB directly and the clients thereof. We adjusted things like work_mem, maintenance_work_mem, max_wal_size, max_connections and random_page_cost (tuning ourselves for SSDs). All of these things helped, but they were somewhat orthogonal to some lingering issue that we had been unable to pin-point [yet].
Lastly, before going back to the drawing board, we experimented with host autoscaling. In short, it was a bust. We found that the number of hosts immediately went up to the max capacity and never looked back. From a cost management standpoint, this was exactly the direction we did not want to head.
Solution
At this point we had learned a tremendous amount about where the issue was not, and we were still seeing less than ideal behavior from our app-stack. After some Googling it turns out that there is a known issue with Sidekiq, our background job processing framework of choice, that could result in unbounded memory growth on DB connections to Postgres / PostGIS databases. Eureka! Finally something concrete we could work with! Immediately an adage from a former colleague of mine came to mind. He asserted that “PHP must die;” as it turns out Sidekiq workers connected to PostGIS must [also] die.
During the week, with frequent deploys, we were able to keep the DB memory utilization under control, but this was not a sustainable approach. To combat this problem I created sidekiq-max-jobs, a ruby-gem that provides the ability to configure the maximum number of jobs a Sidekiq worker will process, or maximum runtime, before terminating [itself]. For an environment running on Kubernetes it is a perfect addition, because once the affected pod dies it will automatically be restarted, [gracefully] resetting memory, database-connections, etc. with minimal interruption to throughput. This was exactly what we needed.
In addition to integrating sidekiq-max-jobs, we also took the opportunity to split our cluster into multiple pools. We had been operating under the assumption that there was cross-talk between our API and worker nodes and wanted to test this theory. Breaking things out allows us to scale the different layers of our infrastructure independently. Rather than one pool, we broke things out into three pools: app (our web and api servers), async (our background workers) and infrastructure (everything else; caching load-balancing, ingress / egress). We used instance tagging / labeling to control host affinity.
Conclusion / Results
After breaking out the pools users noticed & reported a significant improvement to the speed of the mobile app, particularly for initial load, which in the worst case could be unreasonably long, now consistently < 5 seconds. The multi-pronged approach to chasing down the root-cause worked, as it typically does, and as a result of our tweaks we were ultimately able to scale our DB back down and right-size our nodes, per pool, which will save us money in the long run. Along the way we hardened our alerting / monitoring and overall feel we are much better poised for horizontal scalability, and happy customers, going forward.
Until next time…