BlogNetwork And Distributed Filesystems On Linux

network file systems

Before chosing Riak CS, we considered using a network filesystem. This is the result of our investigation

GlusterFS (1.3.12)

Using Hurd-like translators (modules) it is possible to build scalable storage practically arbitrary size.

  • Entirely in userspace


  • Synchronisation with NTP server is required for AFR
  • Doesn't deal well with a big number of small files. For example it takes up to 20 seconds to `ls` catalog with 3500 files.
  • Nameserver is a single point of failure
  • ALU scheduler doesn't work properly causing exhausting of disk space on some nodes
  • It is very difficult to create configuration for performance translators on client which would not cause sacrifice of reading or writing speed. For example optimisation for reads in my case caused archive to be unpacked 22 times slower then on NFS (~11 hours vs ~30 minutes) on 1 AFR on 2 unified nodes configuration.
  • Gluster stores its metadata in extended attributes of underlying filesystem, so for ext2, for example, 'user_xattr' should be enabled.
  • Inotify doesn't work yet
  • No shared mmap support in libfuse. This means that applications like ctorrent will not work with gluster.
  • Mandatory locking, provided by 'posix-locks' translator does not preent file deletion.
  • Sometimes it is impossible to umount fs


  • Necessity to patch FUSE for flock()
  • Striping is very slow. Striping invented to distribute the load on access to very big files (100Gb - 2Tb)


  • Poor scalability. One of most often issues with NFS is partition filling, resulting in necessity to move volumes between partitions or servers, in order to give them more space. And then remount the space once moved
  • Best only in small to mid-size installations
  • NTP synchronisation is required.


  • Limit on partition size (2Tb as far as I remember)
  • Limit on file size, 2Gb
  • Poor support


Collection of servers grouped together administratively to 'cell'. Each cell consists from volumes (kinda like partitions). AFS root look like this: where 'etc', 'home' and 'tmp' are mounted volumes.

There are RW and RO volumes. Server can contain RO copies of other volumes, so if you have RO copy from "dead" server on active server, clients will use RO copies on alive servers. 0-6 distributed RO could exist.

Suppose you have three servers: fs1, fs2 and fs3.
fs1 has RW vol abc, fs2 contains RO of abc, fs2 has RW of cde, fs3 contains RO of cde

AFS automatically tracks volume location, so it is possible to move volumes from one server to another without user to notice anything.

  • Client cache to reduce network load. It is very important on confuguration with large number of hosts. Cache Manager can be configured to work with a RAM
  • Location independence. It means that user doesn't need to know which fileserver contains the file, he only needs pathname
  • Replicated volumes. Volume that contains frequently accessed data can be read-only replicated to several servers
  • It is possible to autobalance disk usage


  • Difficult to setup '/' on AFS, which cannot store special files.
  • Mandatory authentication and some implications on security, related to automounting
  • The limit on 256 fileservers per cell
  • Recommended small size of volume for ease of administration (e.g. upgrade, balance load)
  • Necessity to perform 'vos release' for 'releasing' data to RO volumes



  • Changes made to the cluster on one machine show up immediately on all other machines in the cluster
  • Necessity to dedicate shared storage server. It is possible to combine them using a volume manager, but still complexity and implications of such approach is vague
  • GFS2 is highly experimental and must not be used in a production environment yet

According to materials from "Proceedings of the Linux Symposium" (June 27thA-30th, 2007 Ottawa, Ontario Canada)

When using the fcntl(2) command F_GETLK note that although the PID of the process will be returned in the l_pid field of the struct flock, the process blocking the lock may not be on the local node. There is currently no way to find out which node the lock blocking process is actually running on, unless the application defines its own method. GFS2 does not support inotify nor do we have any plans to support this feature. We would like to support dnotify if we are able to design a scheme which is both scalable and cluster coherent.


  • Has support, but it is Oracle
  • Maximum number of nodes is 256
  • No exclusive write lock
  • No ACL's
  • Requires entire HDD as block device. OCFS2 on top of a software RAID is not supported


Pvfs2 is intended for high performance computing, therefore has no protection from failures

  • Arbitrary cluster size
  • High performance
  • When a data-server or metadata-server fails, filesystem goes down. when raid in degraded mode, the fs is still operable, but the speed of entire fs will be limited to the slowest device. For example, if you have 4 nodes and one of them is running in degraded mode, all nodes will run at that same speed, but the FS will still be fine.
  • Data migration between servers is not possible
  • Hardlinks are not working
  • Does not have locks
  • Patches should be applied on kernel from linus.
  • Seems can be built only for kernels >


  • Lustre. Like Gluster, but sometimes causes kerkel OOPS's. Also it is in RHEL, SUSE, but not in Debian
  • Parallel NFS. PNFS is a extension to the NFSv4 protocol, not compatible with nfs4
  • Plan 9 fs. Strictly dependant from OS
  • Circle. Written on python
  • CIFS. microsoft
  • CXFS (clustered XFS). SGI, not open source
  • GFarm. Too many unresolved problems like not implemented group permissions or lack of support of RO metadata servers. Also it is not in Debian etch and its scripts for the older version of slapd. Too difficult to deploy
  • Ceph. The first time I checked it was ready for production. The second time I checked, it was used by big providers, such as Aixit Gmbh for providing "cloud" services in test mode. But you cannot chose the right block size of Ceph filesystem for the optimal performace. As a result it underperforms in speed tests.

15 June, 2012