BlogHow to design a stable storage system


Riak CS editor's selection

We used every link in evolutionary chain of storage management: filesystem -> network filesystem -> object storage. Let me quickly share what I learned.


Filesystems

XFS, Ext2/3/4 are all fast and mature filesystems. But when you add disks, in order to increase space of your LVM volume, you discover that risk of downtime increases as well. See article on scalability where I mention Dirichlet principle.

To minimize that risk, you would have to use a network filesystem with additional servers.


Network Filesystems

We tried most of network filesystems.

The most annoying issues with that type of filesystems are poor scalability and poor bad fault tolerance. Those issues are even more annoying if you store application assets on network filesystem and then links on them in RDBMS. In our case we had 40 physical application servers with data stored on NFS. Once in a while NFS lost some files, so the app UI looked broken.


Object storage is the next logical step in evolution of storage systems. It addresses issues of fault tolerancy and scalability, provides application interface and allows you to store metadata as attributes of objects.

Application can talk directly to the object storage, avoiding redundant logic, that would be used in case of RDBMS. If you use RDBMS with object storage, primary keys can be stored in object metadata now.


Object Storage

We have tested most of object storage systems.

CouchDB stores everything in JSON, so binary objects become very large.

Cassandra has unpredictable performance, as any other overbloated system with such amount of background tasks.

OpenStack Swift. While it has an acceptable quality, unlike other parts of OpenStack, it stops serving objects when one of three assumed nodes fail. That is a wrong implementation of eventual consistency from my point of view. But we have used OpenStack Swift for about two years, until the load became too big and maintenance became too difficult.

LeoFS has a poor implementation, as our studies have shown. It uses Erlang's "ets" for storing metadata, which makes resource comsumption unpredictable.

ZoDB with ZEO is too slow. Also it uses BerkeleyDB, as backend. It would be very difficult to restore BerkeleyDB in case of failure, as there'a no tools for maintenance.

MongoDB

. The first time I checked it was too unstable. It just segfaulted when we stored 9 million of small objects. Laster I had to use MongoDB anyway, and it turned out to have master-slave type of cluster. This means changes from master do not appear instantly on slaves. It is not good for highly loaded systems

Redis

is good for caching, but not on highly loaded systems. When you turn on persistance, it becomes too slow for high load.

Amazon S3 is expensive. Also see article Dangers of the Clouds.

Riak CS currently is the best object storage system on the market. It has a predictable memory and performance characteristics, as well as predictable latencies. Applications, written using Erlang Open Telecom Platform are known to work tens of years without human intervention. Riak is implemented using latest studies in distributed systems and is supposed to be used for installations that consist of 3 and more physical servers. Nevertheless it is stable and fast.


That knowledge should allow you to avoid mistakes and redundant parts in software design.


28 November, 2017