Seafile data structure

Roman Marakulin
6 min readJul 20, 2022

--

Introduction

In my previous story Your cloud storage at home I made a comparison between different clouds and chose the fastest one, that is seafile. It still works great, but it bothered me a lot, that I haven’t understood the structure of a storage. If you take a look at seafile-data/storage folder, you won’t find files, but they are there. Thus, I was curious, how seafile stores files and folders if there is a way to recover files in case the cloud gets damaged and I’m left with just this seafile-data folder alone.

There are two goals, that I wanted to achieve:

  1. Understand the structure of the storage
  2. Write down a .py script to restore files (to get back files with names and a directory tree structure)

So, lets start our investigation! A disclaimer — I investigated the 9.0.2 version of seafile. Maybe they will change this structure in the future.

TLDR: You can check the algorithm here: https://github.com/awant/seafile_data_recovery

An overview

First, I’ve decided to take a closer look at the directories that I have:

Judging by the names (I’m just trying to guess here):

  1. commits folder should be for all changes, such as uploading new files, removing them and editing their names
  2. fs can be responsible for storing a directory tree
  3. blocks is probably in charge of storing actual data, blocks of files

Also, it is hard not to notice that folders in commits, fs, blocks have the same names. With only one exception, that commits folder contains two additional folders). So, they (folders in commits, fs, blocks) have to be connected.

Dive deeper

I’ve started googling for recovery tools of seafile folders and I’ve found, that there is a script seaf-fsck.sh in the seafile-server main folder, that is responsible for recovery and this was a great entry point to me. It led me to sources on github (I’m so glad, that seafile is an opensource!). And finally, I was able to investigate the whole process of recovery from this point.

Repos

Let’s start with an example. My seafile cloud storage directory tree (from web ui):

There are 4 top-level directories: Documents, My Library, MyNewLibrary, one_another_library. These libraries (they called so in web ui) are actually repos internally and every library has a repo_id. These folders, that I listed earlier are exactly them:

They all “recover” independently.

Every repo (regardless of the folders: commits, fs, blocks) has 2 symbol names folders and inside each of them there is a file. For example:

These 2 names (a folder name, a file name) together form the single object_id (more on that later):

Every single file in these 3 folders is addressed with repo_id and object_id:

So let’s dive deeper into each folder.

Commits

Commits folder is exactly for purposes of logging actions with the storage.

Every file, remember, that they all are have this path pattern: commits/<repo_id>/<object_id[:2]>/<object_id[2:]> , contains a json and you can easily preview it by this one-liner from command line:

python -c “import json; print(json.dumps(json.load(open(<filepath>, ‘r’)), indent=4))”

For example from one of my files I’ve got:

  1. The commit_id matches with the object_id (from the filepath)
  2. root_id is an object_id, which references the root folder in fs.
  3. description points to a corresponded action. A couple of examples more:
    - Added directory “folder2”
    - my_avatar.png”
    - Renamed “ 9.jpg”
  4. parent_id refers to the previous commit (it is also an object_id)
  5. repo_name is a name of a top-level folder
  6. There is a creator_name field to sepatare folders by users

FS

FS folder stores a tree structure for folders and files in them.

Content in files is compressed. You can decompress it by using zlib library. A python one-liner for this could be:

python3 -c “import zlib; import json; print(json.dumps(json.loads(zlib.decompress(open(<filepath>, ‘rb’).read())), indent=4))”

I have these outputs:

  1. b8/8ab96740ef53249b9d21fb3fa28050842266ba

2. fa/fcea93a3343f3db91adc212f58c03fec2e437c

As we saw earlier, the commit pointed to the object_id = fafcea93a3343f3db91adc212f58c03fec2e437c.

  1. dirents defines object_ids for folders and files in the fs folder
  2. mode is used to check if it is a file or a folder
  3. block_ids is object_id’s from blocks folder

Blocks

Blocks folder is just a storage for bytes of files. One file can be divided into several blocks (files in the blocks folder). To restore the file, you should merge blocks from a desired repo according to block_ids field in a file from the fs folder.

Let’s take an example. For this repo 52ad7c5f-403b-4949-a12a-ed1a5a537d69 I have only one block 86/63a70ef30a5987b440a621483af2044bae1e0a, that is definitely a content of the seafile-tutorial.doc file.

Recovering seafile-data

As we covered all folders, it is time to discuss the algorithm of recovering data:

  1. Extract all repo_ids — filenames from the commits folder. These are our top-level, independent folders: a library and a user
  2. For every repo_id separately we should make next steps
  3. Find the last commit (by ctime), that has non zero root_id field. It is a last change and this commit points to the actual folder in fs folder
  4. Define a user from this commit — creator_name. it is required to separate folders of different users
  5. Bypass recursively (next steps) the fs folder, starting from object_id = root_id from the last commit
  6. Get dirents from file content by object_id and for each of them check if it is a file (from mode field) or a folder
  7. If it is a folder, than — create a folder with the name = dirent[‘name’] in a current path, go to this folder and make a recursive call with object_id = dirent[‘id’]. If it is a file, than — get block_ids from file content of the object_id = dirent[‘id’] in fs folder, create a file with the name = dirent[‘name’] under a current path and fill it with content from corresponnding to block_ids blocks.

I’ve created a couple of python scripts to sum up this research:

  1. seafile_data_recovery.py — to actually ‘recover’ files. Basically — collect and merge all block_ids of relevant files
  2. seafile_data_profiler.py — to check content of folders and files. It helped me to write the previous script

My goal was to catch the structure of folders and write a simple script for my own purposes, that’s why I skipped different validations, checks - to increase code readability. You can easily extend it if you want.

Overall, the structure turned out to be extremely simple and it mimics how files are actually stored in file system. I liked that there is an easy access to filesystem changes (commits).

--

--

Roman Marakulin
Roman Marakulin

Written by Roman Marakulin

I write about Technologies, Software and my life in Spain

No responses yet