Fileserver implementation

8 years ago · 63762cab74
parent ebb1892c74
commit 63762cab74
2 changed files with 34 additions and 36 deletions
--- a/Fileserver.md
+++ b/Fileserver.md
@ -1,35 +0,0 @@
-# Fileserver
-
-The fileserver is a simple frontend to a disk storage space that contains auxiliary files for assignments, archives with job configuration and files submitted by users and evaluation results. These files are the only ones required for backend to run, so dedicated fileserver gives the possibility of testing backend separately. Also, one fileserver instance could be shared among multiple API instances (with the same broker), so common files does not need to be duplicated in each API instance.
-
-One exception is that important files with character of database entry (but not stored in database due to size) are stored directly in filesystem of API server. But this fact does not devaluate benefit of separate fileserver. From security point of view, fileserver should be completely isolated from public internet to keep the data safe while API server must be public from its nature.
-
-For a description of the communication protocol used by the frontend 
-and workers, see the [Communication](#communication) chapter.
-
-
-## Description
-
-The storage is implemented in Python, using the Flask web framework. This 
-particular implementation evolved from a simple mock fileserver we used in early
-stages of development. It prooved to be very reliable, so we decided to keep fileserver
-as separate component instead of integrating this functionality into main API.
-
-### Internal storage structure
-
-Fileserver stores its data in a configurable filesystem folder. This folder has 
-the following subfolders:
-
- `./submissions/<id>` -- folders that contain files submitted by users 
-  (student's solutions to assignments). `<id>` is an identifier received from 
-  the ReCodEx API.
- `./submission_archives/<id>.zip` -- ZIP archives of all submissions. These are 
-  created automatically when a submission is uploaded. `<id>` is an identifier 
-  of the corresponding submission.
- `./tasks/<subkey>/<key>` -- supplementary task files (e.g. test inputs and 
-  outputs). `<key>` is a hash of the file content (sha-1 is used) and `<subkey>` 
-  is its first letter (this is an attempt to prevent creating a flat directory 
-  structure).
-
-
-
--- a/Rewritten-docs.md
+++ b/Rewritten-docs.md
@ -2629,7 +2629,40 @@ used.

 ## Fileserver

-@todo: stores particular data from frontend and backend, hashing, HTTP API
+Fileserver component provides shared storage between frontend and backend. It is
+writtend in Python 3 using Flask web framework. Fileserver stores files in
+configurable filesystem directory, provides file deduplication and HTTP access.
+To keep the stored data safe, fileserver is not visible from public internet.
+
+### File deduplication
+
+File deduplication is designed as storing files under the hashes of their
+content. This procedure is done completely inside fileserver. Plain files are
+uploaded into fileserver, hashed, saved and the new filename returned back to
+the uploader.
+
+SHA1 is used as hashing function, because it is fast to compute and provides
+better collision safety than MD5 hashing function. Files with the same hash are
+treated as the same, no additional checks for collisions are performed. However,
+it is really unlikely to find one.
+
+### Storage structure
+
+Fileserver stores its data in following structure:
+
+- `./submissions/<id>/` -- folder that contains files submitted by users 
+  (student's solutions to assignments). `<id>` is an identifier received from 
+  the REST API.
+- `./submission_archives/<id>.zip` -- ZIP archives of all submissions. These are 
+  created automatically when a submission is uploaded. `<id>` is an identifier 
+  of the corresponding submission.
+- `./tasks/<subkey>/<key>` -- supplementary task files (e.g. test inputs and
+  outputs). `<key>` is a hash of the file content (`sha1` is used) and
+  `<subkey>` is its first letter (this is an attempt to prevent creating a flat
+  directory structure).
+- `./results/<id>.zip` -- ZIP archive of results for submission with `<id>`
+  identifier.
+

 ## Worker