It’s been a long a long time coming, hard work has finally paid off and the last 7 months feels like just only few weeks. Couchbase is now our primary NoSQL (key-value) store for production and we are impressed with the results. This article is about our hands-on experience, benchmarking results and its associated challenges.
We work in the online advertising market and for today’s internet user speed is everything. Therefor, latency is paramount in
our application design thus, we needed something fast to store various user information including targeting data.
Few years ago, the choice was Voldemort for its latency and
speed, but unfortunately the product was not only vulnerable to cluster changes and disasters, but also was featuring small
user group so support was difficult.
memcached always looked promising but the lack of clustering and disk persistence was
too expensive for our production suite.
Then Couchbase (membase with persistance backend) came along which was pretty new on the market, was going through couple of
re-brandings in a short period but it used
memcached under the hood along with seamless clustering, auto-recovery and disk
persistence. Sounds like a dream? Well it was, but we had to wake up quickly in the middle of our migration because it was
just not going the way we wanted to.
As usual, one of our development team grabbed the Java SDK and started working on the code to implement a client in our production suite. In the meantime we carved a new cluster out of 4 servers with the following specs:
- HP DL 360G7 Server
- 12 cores
- 256GB RAM
- 200GB SSD for disk persistence
- CentOS 6.1
- Couchbase Server 1.8.0
It appeared to be a capable beast and after a short period, we started populating data to it in parallel with the existing Voldemort stores. We loaded approximately 160 Million, keys to one of our high throughput buckets within a matter of weeks.
Initial tests showed poor latencies of
5ms and our application was throwing zillions of errors within seconds so we started refactoring.
MBeans to have visibility, gone over the SDK documentation number of times, added debugging metrics and it seemed whatever we do,
it is just not working out for us. We began monitoring the servers themselves and noticed that our utilization (context switching, interrupts)
do not seem to be affected by our application tests, so I started believing that we trip somewhere so early, that we don’t even make it to the
This went on for some time, even engaged the commercial support which was helpful to pinpoint potential issues in our setup, but we have not really got much difference in results and our team really exhausted all of its options at that point of time. Couple of months later we felt that it’s time to look for something else and my team began looking at AeroSpike with the exception of me. I simply refused to give up and was unable to accept failure.
Without much motivation I continued looking, and after so many hours of troubleshooting, Google searching I bumped into something promising we should have (perhaps) looked at much earlier, the Java Load Test Generator from the community wiki. Disregarding our “sprint” plans, I grabbed a couple of developers along with the code and started a (somewhat) secret project to work on a “real load tester” application. The code was a bit hard to read but we managed to modify it so it worked with our keys from the pre-populated (160 million) data bucket not with some random generated garbage. We fired it from a single node and immediately managed to squeeze more juice out of the cluster than ever before.
This is when the real development commenced with real, production data, on our production suite although we carried out the exercises off peak for our
customers’s safety. We had to play with the setup to find the sweet spot and after a day of testing, we finally made a breakthrough. Of course it became
obvious that our
client implementation was the problem not the product but without having another implementation alongside and of course enough
hands-on experience, it was impossible to prove. In a couple of weeks time, we completely refactored our client implementation based on the load tester
and I am happy to say that it’s in production now for a good couple of months without a single glitch.
- Extracted 2 million keys out of our production bucket into a flat file, the load test client read this file into memory during startup
- Ensured that all of our buckets are 100% memory resident (in-memory store only), even with SSDs reading is just too costly for us
- Upgraded our cluster to version 1.8.1 and also patched it with Hotfix MB-6550
- Vacuumed all of our persisted databases on disk (SQLite) to ensure maximum performance (our production data population, [write only] was constantly running during our benchmarks tests)
- Copied the modified
ycsb.jaralong with the rest of the load test code to our NFS share
- Adjusted the JVM to use 1G Max Heap along with the recommended JVM options however the difference was neglectable
- Executed the load test from 20 production application servers, all at once
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Results and Conclusion
- Couchbase scaled very well during our test, we achieved 400 000 GET requests / second with an average latency of 400us (micro-second), 99th = 4ms without using Multi-GET
- CPU usage was not too heavy although we observed 50K to 120K interrupts / second during load testing (per cluster node)
- Increasing the client side threads beyond 16 doubled the latency with only 20% increase in throughput, although it is likely to be caused by the limitation of the client application (or its hardware)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
It seemed that Couchbase really shines between 15% (~75K) and 80% (400K) of it’s max throughput (520K). Low volume buckets (1-2 GETs / second or less) only reach 600 - 700us regardless of disk write queue, resident ratio, etc. This suggests that perhaps there is some kind of prefetch mechanism for actively requested data sets so if you request something frequently, you get better results. Adding more nodes with less memory would increase throughput, improve bucket sharding, reduce network generated interrupts.
In nutshell: Couchbase performs better under pressure but you have to keep the pressure within 80% of your maximum throughput to maintain lowest latency.