The recurring idea of Erlang on Xen is to keep an interface and change its implementation according to the vision of a truly scalable cloud. One such interface is a ‘filesystem’. It was fully reimplemented to expose mostly synthetic files and files accessible over the network. Contrary to the industry custom, our current implementation does not embrace locally-attached block devices. For such devices, a key/value datastore is being designed, that should expose a different interface – ‘ets/dets’.
We strive to support all important Erlang applications with our LING VM. Adaptation of ejabberd proved that we need to support at least a mockery of a filesystem over a writable block device. Such mockery is the GooFS filesystem described here.
A GooFS formatted block device usually serves as a backend for a mnesia database that keeps data in several dets tables mapped one-to-one to files. For example, the mnesia database of ejabberd uses 29 files all created in a single directory. The longest file name is 23 characters. This directly translates into design constraints of the new filesystem.
A flat filesystem, no directories
A limited number of files, at least 256
A file name may be long, at least 256 chars
Must not get corrupted easily, should survive domain_poweroff()
Must be easy to implement, in a few days
Note that GooFS is not the only modern filesystem, which is flat. See Amazon S3.
Every GooFS volume has 6 areas:
superblock
root record
file directory
transaction journal
allocation table
data extents
The superblock is a single sector (sector 0) that contains the general information about the volume.
The root record contains information about the root directory. The information is similar to what is kept for ordinary files. The root record occupies a single sector (sector 1).
The file directory contains information about all files that exist on the volume. Each directory entry occupies a single sector starting with sector 1. The length of the file directory – nr_files – is set upon formatting of the volume. It must be at least 255. Each file is addressed using an integer value, which is actually an index into the file directory. nr_files must be less then 65536.
The transaction journal contains information about the current outstanding transaction. It occupies exactly 32 blocks. Each journal entry is 16 bytes long limiting the transaction to 1024 blocks. The number of entries actually used – trans_len – is recorded in the superblock. Note that updating the entire allocation table requires writing to no more than 512 blocks. The transaction journal starts at sector 1 + nr_files.
The allocation table keeps track of chunks of disk space – extents – allocated to files. The size of a data extent – extent_size – is fixed upon formatting. A minimum value of extent_size is 128 sectors. The number of data extents – nr_extents – must be less than or equal to 65536. The allocation table has an entry for each extent. It starts at sector 1 + nr_files + 32. Each allocation table entry is 4 bytes long. Thus each sector holds 512/4 = 128 such entries. Hence the size of the allocation table – atab_size – is (nr_extents+127)/128.
The data extents area is where all file contents is stored. It immediately follows the allocation table. Thus the area starts at sector 1 + nr_files + 32 + atab_size.
Note that by increasing the extent_size value the filesystem may occupy volumes of almost any practical size. On average, a half of extent_size will be ‘wasted’ for each file.
Starts at | Size |
---|---|
0 | 1 |
The byte ordering here are later in the document is big-endian.
Offset | Type | Description |
---|---|---|
0 | uint16_t | nr_files |
2 | uint16_t | nr_extents |
4 | uint32_t | extent_size |
8 | uint64_t | 0x2f476f6f2f46532f, magic |
16 | uint16_t | version, must be 0x1 |
18 | uint16_t | trans_len, 0 – no outstanding transaction |
20 | – | UNUSED |
Starts at | Size |
---|---|
1 | 1 |
Offset | Type | Description |
---|---|---|
0 | uint32_t | version, incremented upon each modification |
4 | uint64_t | object unique id |
12 | uint32_t | mode, opaque |
16 | uint32_t | atime, access time, Unix timestamp |
20 | uint32_t | mtime, modification time, Unix timestamp |
Starts at | Size |
---|---|
2 | nr_files |
Each directory entry occupies a whole sector. The layout of the entry is as follows:
Offset | Type | Description |
---|---|---|
0 | uint16_t | 0 – free entry, 0xffff – occupied entry |
2 | uint64_t | file size |
10 | uint32_t | version, incremented upon each modification |
14 | uint64_t | file unique id |
22 | uint32_t | mode, opaque |
26 | uint32_t | atime, access time, Unix timestamp |
30 | uint32_t | mtime, modification time, Unix timestamp |
34 | uint16_t | name len |
36 | char[len] | name |
Note that the maximum file name length is 476.
Starts at | Size |
---|---|
2 + nr_files | 32 |
Each transaction log entry has the following layout:
Offset | Type | Description |
---|---|---|
0 | uint64_t | source sector |
8 | uint64_t | target sector |
The number of active elements in the transaction log – trans_len – is recorded in the superblock.
Starts at | Size |
---|---|
2 + nr_files + 32 | atab_size |
Each allocation table entry has the following layout:
Offset | Type | Description |
---|---|---|
0 | uint16_t | file index, 0xffff – unallocated extent |
2 | uint16_t | extent offset |
A file that occupies N extents will have exactly N entries in the allocation table. The entries will have offset fields set to all numbers from 0 to N-1. All entries will have their file index field set accordingly.
Note that the maximum allowed file index is 65534 (0xfffe).
Let us assume we need to format a volume that has a certain number of sectors – nr_sectors. nr_files is another input parameter that defaults to 255.
The number of sectors left for the data extent area is nr_data_sectors_g = nr_sectors – 1 (superblock) – 1 (root record) – nr_files (file directory) – 32 (transaction journal) – 512 (maximum allocation table).
Now we need to select extent_size that lets map all nr_data_sectors_g given the limitation that nr_extents must be no more than 65536.
The minimum extent_size is 128 sectors. We start with this number and double it until extent_size * 65536 is at least nr_data_sectors_g. At this point, we fix the extent_size value, and calculate nr_extents_g as nr_data_sectors_g / extent_size and atab_size_g as (nr_extents_g+127)/128.
We may try another round of reshuffling to improve the estimate for nr_extents. For simplicity we just use nr_extents_g for nr_extents (and atab_size_g for atab_size). Most probably, some sectors will be left inaccessible at the end of the volume.
Now we proceed to writing a superblock using nr_files, nr_extents, and extent_size. The formatting operation must be atomic. To provide for atomicity we set trans_len field in the superblock to 0xffff for the duration of formatting. Then we initialize the root record. After that we write nr_files zeroed-out sectors that constitute a file directory. Then we skip 32 blocks for the transaction journal and write atab_size sectors with all bytes set to 0xff as the allocation table.
After rewriting superblock and setting trans_len to 0, the volume is ready for use.
Transactions provide atomicity to all filesystem metadata updates. Failure to complete the sequence of modifications to the file directory and the allocation table will in most cases leave the entire filesystem unusable.
To insure atomicity filesystem metadata is updated in two stages.
Write blocks to unallocated space and populate the transaction journal
Write blocks to intended places and reset the journal
Suppose we need to update a directory entry at sector 10 and write the allocation information to sector 33. First we find two unallocated sectors. Let’s assume their numbers are 1200 and 1201. Then we write a copy of the data intended for sector 10 to sector 1200, for sector 33 – to sector 1201. Then we write two entries to the transaction journal: 10 → 1200, 33 → 1201. Everything is ready to start the transaction. It is started by setting the trans_len field in the superblock to 2.
Now we proceed to writing data to their true locations, sectors 1200 and 1201. When we are done, we reset the trans_len field to zero to end the transaction.
In case a catastophic event happens during the first stage, before we set trans_len to 2 in the superblock. The data written to sectors 1200 and 1201 (and the transaction journal) is disregarded. If the event interrupts the second stage, the transaction journal is rerun restoring the filesystem to fully consistent state.
The filesystem should issue a ‘write barrier’ after each stage of the transaction.
The software layer that exposes a filesystem interface using GooFS may maintain in-memory representation of the file directory of the allocation table. Yet these representations must be fully synchronised. The layer must be ready for a ‘power off’ at any time.
Such approach assumes that the actual disk driver underneath the virtual block device performs caching and the database that uses GooFS does its own caching.
The filesystem should flush caches periodically (every second by default) to increase durability of writes.
The directory listing may use in-memory metadata only. Access time should be updated in the root record.
Check that the number of files is less than nr_files
Find an unoccupied slot in the file directory
Update the file directory and the root record, both on disk and in memory
Retrieve the list of extents occupied by the file
Mark the directory entry as unoccupied
Mark all file extents as unoccupied
Update both the file directory, the root record, and the allocation table, in memory and on disk
Resolve read offset and size into a list of chunks, so that each chunk is within a single extent
Read the chunks and concatenate them
Update the access time of the file, both in memory and on disk
Determine if the file shrinks or grows
If the file shrinks, determine the extents that are no longer needed and mark them as unallocated
If the file grows, determine the number of additional extents it needs
If there is enough free space available, find unallocated extents and mark them as occupied by the file in the allocation table
Update the file modification time and the allocation table, both in memory and on disk
Check if the write operation changes the size of the file, perform the change size procedure if needed
Resolve write offset and size into a list of chunks, so that each chunk belongs to a single extent
Write data to the chunks
Update the modification time of the file, both in memory and on disk
Due to caching on various levels writes reach the disk media before the operation completes. This may lead to corruption of the file data in case of a ‘power off’ event.