GooFS: the filesystem we thought we do not need
———————————————–

Legacy applications call for a compromise

The recurring idea of Erlang on Xen is to keep an interface and change its implementation according to the vision of a truly scalable cloud. One such interface is a ‘filesystem’. It was fully reimplemented to expose mostly synthetic files and files accessible over the network. Contrary to the industry custom, our current implementation does not embrace locally-attached block devices. For such devices, a key/value datastore is being designed, that should expose a different interface – ‘ets/dets’.

We strive to support all important Erlang applications with our LING VM. Adaptation of ejabberd proved that we need to support at least a mockery of a filesystem over a writable block device. Such mockery is the GooFS filesystem described here.

A typical usage scenario

A GooFS formatted block device usually serves as a backend for a mnesia database that keeps data in several dets tables mapped one-to-one to files. For example, the mnesia database of ejabberd uses 29 files all created in a single directory. The longest file name is 23 characters. This directly translates into design constraints of the new filesystem.

A simplest filesystem that fills the need

  • A flat filesystem, no directories

  • A limited number of files, at least 256

  • A file name may be long, at least 256 chars

  • Must not get corrupted easily, should survive domain_poweroff()

  • Must be easy to implement, in a few days

Note that GooFS is not the only modern filesystem, which is flat. See Amazon S3.

Inside a GooFS volume

Every GooFS volume has 6 areas:

  • superblock

  • root record

  • file directory

  • transaction journal

  • allocation table

  • data extents

The superblock is a single sector (sector 0) that contains the general information about the volume.

The root record contains information about the root directory. The information is similar to what is kept for ordinary files. The root record occupies a single sector (sector 1).

The file directory contains information about all files that exist on the volume. Each directory entry occupies a single sector starting with sector 1. The length of the file directory – nr_files – is set upon formatting of the volume. It must be at least 255. Each file is addressed using an integer value, which is actually an index into the file directory. nr_files must be less then 65536.

The transaction journal contains information about the current outstanding transaction. It occupies exactly 32 blocks. Each journal entry is 16 bytes long limiting the transaction to 1024 blocks. The number of entries actually used – trans_len – is recorded in the superblock. Note that updating the entire allocation table requires writing to no more than 512 blocks. The transaction journal starts at sector 1 + nr_files.

The allocation table keeps track of chunks of disk space – extents – allocated to files. The size of a data extent – extent_size – is fixed upon formatting. A minimum value of extent_size is 128 sectors. The number of data extents – nr_extents – must be less than or equal to 65536. The allocation table has an entry for each extent. It starts at sector 1 + nr_files + 32. Each allocation table entry is 4 bytes long. Thus each sector holds 512/4 = 128 such entries. Hence the size of the allocation table – atab_size – is (nr_extents+127)/128.

The data extents area is where all file contents is stored. It immediately follows the allocation table. Thus the area starts at sector 1 + nr_files + 32 + atab_size.

Note that by increasing the extent_size value the filesystem may occupy volumes of almost any practical size. On average, a half of extent_size will be ‘wasted’ for each file.

The superblock
Starts at Size
0 1

The byte ordering here are later in the document is big-endian.

Offset Type Description
0 uint16_t nr_files
2 uint16_t nr_extents
4 uint32_t extent_size
8 uint64_t 0x2f476f6f2f46532f, magic
16 uint16_t version, must be 0x1
18 uint16_t trans_len, 0 – no outstanding transaction
20 UNUSED

The root record

Starts at Size
1 1
Offset Type Description
0 uint32_t version, incremented upon each modification
4 uint64_t object unique id
12 uint32_t mode, opaque
16 uint32_t atime, access time, Unix timestamp
20 uint32_t mtime, modification time, Unix timestamp
The file directory
Starts at Size
2 nr_files

Each directory entry occupies a whole sector. The layout of the entry is as follows:

Offset Type Description
0 uint16_t 0 – free entry, 0xffff – occupied entry
2 uint64_t file size
10 uint32_t version, incremented upon each modification
14 uint64_t file unique id
22 uint32_t mode, opaque
26 uint32_t atime, access time, Unix timestamp
30 uint32_t mtime, modification time, Unix timestamp
34 uint16_t name len
36 char[len] name

Note that the maximum file name length is 476.

The transaction journal
Starts at Size
2 + nr_files 32

Each transaction log entry has the following layout:

Offset Type Description
0 uint64_t source sector
8 uint64_t target sector

The number of active elements in the transaction log – trans_len – is recorded in the superblock.

The allocation table
Starts at Size
2 + nr_files + 32 atab_size

Each allocation table entry has the following layout:

Offset Type Description
0 uint16_t file index, 0xffff – unallocated extent
2 uint16_t extent offset

A file that occupies N extents will have exactly N entries in the allocation table. The entries will have offset fields set to all numbers from 0 to N-1. All entries will have their file index field set accordingly.

Note that the maximum allowed file index is 65534 (0xfffe).

Formatting a volume

Let us assume we need to format a volume that has a certain number of sectors – nr_sectors. nr_files is another input parameter that defaults to 255.

The number of sectors left for the data extent area is nr_data_sectors_g = nr_sectors – 1 (superblock) – 1 (root record) – nr_files (file directory) – 32 (transaction journal) – 512 (maximum allocation table).

Now we need to select extent_size that lets map all nr_data_sectors_g given the limitation that nr_extents must be no more than 65536.

The minimum extent_size is 128 sectors. We start with this number and double it until extent_size * 65536 is at least nr_data_sectors_g. At this point, we fix the extent_size value, and calculate nr_extents_g as nr_data_sectors_g / extent_size and atab_size_g as (nr_extents_g+127)/128.

We may try another round of reshuffling to improve the estimate for nr_extents. For simplicity we just use nr_extents_g for nr_extents (and atab_size_g for atab_size). Most probably, some sectors will be left inaccessible at the end of the volume.

Now we proceed to writing a superblock using nr_files, nr_extents, and extent_size. The formatting operation must be atomic. To provide for atomicity we set trans_len field in the superblock to 0xffff for the duration of formatting. Then we initialize the root record. After that we write nr_files zeroed-out sectors that constitute a file directory. Then we skip 32 blocks for the transaction journal and write atab_size sectors with all bytes set to 0xff as the allocation table.

After rewriting superblock and setting trans_len to 0, the volume is ready for use.

How transactions work

Transactions provide atomicity to all filesystem metadata updates. Failure to complete the sequence of modifications to the file directory and the allocation table will in most cases leave the entire filesystem unusable.

To insure atomicity filesystem metadata is updated in two stages.

  1. Write blocks to unallocated space and populate the transaction journal

  2. Write blocks to intended places and reset the journal

Suppose we need to update a directory entry at sector 10 and write the allocation information to sector 33. First we find two unallocated sectors. Let’s assume their numbers are 1200 and 1201. Then we write a copy of the data intended for sector 10 to sector 1200, for sector 33 – to sector 1201. Then we write two entries to the transaction journal: 10 → 1200, 33 → 1201. Everything is ready to start the transaction. It is started by setting the trans_len field in the superblock to 2.

Now we proceed to writing data to their true locations, sectors 1200 and 1201. When we are done, we reset the trans_len field to zero to end the transaction.

In case a catastophic event happens during the first stage, before we set trans_len to 2 in the superblock. The data written to sectors 1200 and 1201 (and the transaction journal) is disregarded. If the event interrupts the second stage, the transaction journal is rerun restoring the filesystem to fully consistent state.

The filesystem should issue a ‘write barrier’ after each stage of the transaction.

File operations dissected

In-memory structures

The software layer that exposes a filesystem interface using GooFS may maintain in-memory representation of the file directory of the allocation table. Yet these representations must be fully synchronised. The layer must be ready for a ‘power off’ at any time.

Such approach assumes that the actual disk driver underneath the virtual block device performs caching and the database that uses GooFS does its own caching.

The filesystem should flush caches periodically (every second by default) to increase durability of writes.

List a directory

The directory listing may use in-memory metadata only. Access time should be updated in the root record.

Create a file
  • Check that the number of files is less than nr_files

  • Find an unoccupied slot in the file directory

  • Update the file directory and the root record, both on disk and in memory

Delete a file
  • Retrieve the list of extents occupied by the file

  • Mark the directory entry as unoccupied

  • Mark all file extents as unoccupied

  • Update both the file directory, the root record, and the allocation table, in memory and on disk

Read a file
  • Resolve read offset and size into a list of chunks, so that each chunk is within a single extent

  • Read the chunks and concatenate them

  • Update the access time of the file, both in memory and on disk

Change a file size
  • Determine if the file shrinks or grows

  • If the file shrinks, determine the extents that are no longer needed and mark them as unallocated

  • If the file grows, determine the number of additional extents it needs

  • If there is enough free space available, find unallocated extents and mark them as occupied by the file in the allocation table

  • Update the file modification time and the allocation table, both in memory and on disk

Write a file
  • Check if the write operation changes the size of the file, perform the change size procedure if needed

  • Resolve write offset and size into a list of chunks, so that each chunk belongs to a single extent

  • Write data to the chunks

  • Update the modification time of the file, both in memory and on disk

Known limitations

Writes may not be durable

Due to caching on various levels writes reach the disk media before the operation completes. This may lead to corruption of the file data in case of a ‘power off’ event.