High I/O redis server with persistence

Posted at — Jun 9, 2018

TL;DR version

When using redis as primary data storage, or in situation where maximum persistence is required, consider:

Enabling only AOF persistence
Limiting data size on single node to <1 GB
Limiting server specs (2 core, 2GB is good enough)
Using high standard disk (to reduce latency for RDB save and AOF write)

Long version

We have been using Redis since its 2.X release, for multiple purposes:

cache layer for our APIs
caching values of external API responses
storage for user session
something similar to queuing :)

Well, we also use it as a primary data storage ^{for our subscription platform.}

It is arguable that this is not good practice, but this post is not about that. In this post I will try to share our experience with scaling issue when using high I/O redis server with persistence.

Background

We store user information in redis hashtables, and use redis cluster with 4 master nodes, and 4 slave nodes. User information includes something similar to point system that can get updated from multiple other services. This requires the platform to provide real-time user information with maximum persistence to the services.
Therefore, both AOF and RDB is enabled at our redis servers.

AOF stands for Append Only Files which write every write operation to a file. This makes it slower because every write wait for successfull disk write.

RDB stands for Redis DataBase is compressed snapshot of redis data. This doesn’t impact performance as much as AOF persistence. However, it may not have all the data for backup (unless BGSAVE is triggered for each write, which is very inefficient).

In the very beginning, platform was in alpha release phase and had about 40K~50K users’ information. Everything was stable and fast.

But about a year later, we have planned to do mass insertion to the platform, increasing user count by about 1000 times.
As anyone would, we began our performance testing. We did two types of tests: redis-benchmark test, and API server load test using Gatling, for different numbers of users: 5 million, 25 million and 50 million.

Issue

There was no major problem when user count was 5 million or 25 million. However, when we tested with highest required QPS possible with 50 million user information, about 5% of the requests have failed for API load test. And from logs we’ve concluded that redis was the bottleneck. However, redis-benchmark tests showed that redis can handle IO for even higher QPS while having 50 million users’ information.

So situation was that redis server can handle the required writes for a short time, but for continuous writes for a long time, significant latency occurs.

We’ve noticed API response had errors every 60 seconds. Also while watching redis-cli -c INFO every second, we’ve confirmed that errors happen whenever rdb_bgsave_in_progress is 1.

Configuration at redis.conf looked like this:

#   save <seconds> <changes>
#
#   Will save the DB if both the given number of seconds and the given
#   number of write operations against the DB occurred.

save 900 1
save 300 10
save 60 10000

So the redis was taking snapshot every 60 seconds because changes within 60 seconds were far more than 10K.

But why is this causing latency? Snapshotting is supposed to be done with another fork, with minimum impact to the redis server. Also why it didn’t happen for other cases when user count was smaller.

It turns out that issue was happening because background SAVE was taking too long, due to large data. This was keeping disk IO busy, and all writes was taking time as well because each update was waiting to be written to the disk (AOF enabled). Meanwhile requests to the API were failing due to timeout.

Logs from one of the redis nodes is as following:

23:42:47.211 * 10000 changes in 60 seconds. Saving...
23:42:47.314 * Background saving started by pid 10210
23:43:04.044 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
23:43:10.361 * DB saved on disk

Resolution

We initially thought of implementing Rolling BGSAVE to take control of snapshotting. However issue didn’t happen because we didn’t control snapshot timing, but the amount of data on the node was too large.

From Redis Cluster specification page:

in a Redis Cluster with N master nodes you can expect the same performance as a single Redis instance multiplied by N as the design scales linearly.

So this meant that we can reduce data on each node by increasing cluster size. We resolved the issue by increasing master node count from 4 to 8, which halved the data on each node.

Additionally, server spec of each redis node was too high. It 2 cores and 32 GB of memory. This was truely inefficient because snapshot process was slow even when data size was less than 1GB. So rest of the 31GB memory was never to be used. It would be better to have lower spec servers, (1 core, 2GB memory ideally) in the beginning to resolve this inefficiency.

Excitements, confusions and frustrations

Knowledge without action is wastefulness and action without knowledge is foolishness. - Al-Ghazali