*** Wartungsfenster jeden ersten Mittwoch vormittag im Monat ***

Skip to content
Snippets Groups Projects

Update formatscaper according to identified requirements

Merged Moser, Maximilian requested to merge mm/rework into master
1 file
+ 35
79
Compare changes
  • Side-by-side
  • Inline
+ 35
79
@@ -16,21 +16,24 @@ The aim here is to assist us with the task of digital preservation; specifically
Formatscaper is designed to build up the required context over its lifetime of use, so no particular setup is required.
To get started, simply feed it a list of files to analyze in the [expected format](#input) and let it run.
The most relevant file is [`formats.yaml`](#formats-file), which is intended to receive limited manual tweaks over time (basically just setting the `endangered` flag for outdated formats).
Every time formatscaper is run, it will update this file with formats that haven't been listed before.
This means that the knowledge base about file formats gets extended over time.
With every run, the database will be extended with information about file formats that haven't been encountered before.
This effectively builds up a knowledge base of the file formats that are actually present in the system (i.e. the file format landscape).
Per default, formatscaper will create a summary of endangered files it encountered and print it to standard out.
A more comprehensive summary for all encountered formats will be stored in a [results file](#results).
*Note* that whenever a new format is encountered, formatscaper will not give out a rating about the *risk of obsolescence* for this format.
This task is instead left to the operator, e.g. via the `resultman` command.
Since file format detection is effectively still based on heuristics, no identification procedure is infallible - sometimes, even the best guess is wrong.
For such cases, we added a mechanism to override the result detected by siegfried on a per-file basis.
More information for this can be found [further down](#known-results).
However, the **risk** of datasets becoming unusable in the future isn't purely based on the formats in question.
Some of the files in a dataset may be stored in a very outdated or problematic format, but the overall dataset can still be reused without them.
One example would be temporary files that have been added to the dataset on accident.
Thus, each scanned file also has an *impact* assessment attached to them, which again needs to be provided by the operator.
This *impact* assessment can be used together with the *risk of obsolescence* for its format to calculate the overall *risk assessment*.
*Note*: Since file format detection is effectively still based on heuristics, no identification procedure is infallible - sometimes, even the best guess is wrong.
For such cases, we added a mechanism to override the result detected by siegfried on a per-file basis in the database.
Example call, with a custom path for the `sf` binary:
```sh
pipenv run ./formatscaper.py --sf-binary "${GOPATH}/bin/sf" -i record-files.yaml
formatscaper --sf-binary "${GOPATH}/bin/sf" -i record-files.yaml
```
@@ -101,125 +104,94 @@ This includes the URI of the file, its original file name (which gets discarded
### Formats file
The formats file (e.g. `formats.yaml`) contains information about previously encountered file formats and their endangerment status.
Information about the encountered file formats can be exported in YAML format (e.g. as `formats.yml`).
This primarily comprises a unique identifier (PUID) and the *risk of obsolescence* (i.e. the *probability of the format dying out*).
Some context (like a human-readable name and MIME type) per format is also provided here, primarily to make it more understandable for operators.
```yaml
- endangered: false
- risk: 1
mime: text/plain
name: Plain Text File
puid: x-fmt/111
- endangered: false
- risk: 1
mime: application/zip
name: ZIP Format
puid: x-fmt/263
- endangered: true
- risk: 2
mime: null
name: Adobe Illustrator CC 2020 Artwork
puid: fmt/1864
- endangered: false
- risk: 3
mime: application/postscript
name: Encapsulated PostScript File Format
puid: fmt/124
- endangered: false
- risk: 2
mime: application/pdf
name: Acrobat PDF 1.4 - Portable Document Format
puid: fmt/18
- endangered: false
- risk: 1
mime: image/svg+xml
name: Scalable Vector Graphics
puid: fmt/92
- endangered: true
- risk: 5
mime: null
name: null
puid: UNKNOWN
```
The "primary key" for file formats are the PRONOM Persistent Unique Identifier (PUID).
Note that formatscaper bases its detection of format endangerment purely on this file.
Whenever a new (previously unlisted) format is encountered, it will be added to this list.
The file is rewritten every time formatscaper is run, so extra information like comments will be discarded.
The PRONOM Persistent Unique Identifier (PUID) can be used to uniquely identify formats.
It can also be used to construct a URL (of the shape `https://www.nationalarchives.gov.uk/PRONOM/${puid}`) pointing to additional information about the format.
### Results
The results file (e.g. `results.yaml`) contains information about each investigated file and their identified formats, along with a notes about their endangerment status.
The file format identification results for each file (along with their risk assessment) can also be exported into a YAML file (e.g. `results.yml`).
The resulting file contains information about each investigated file and their identified formats, along with a notes about their risk assessment.
```yaml
- filename: /etc/hosts
format:
endangered: false
risk: 1
mime: text/plain
name: Plain Text File
puid: x-fmt/111
record: 1234-abcd
impact: 2
- filename: /etc/environment
format:
endangered: false
risk: 1
mime: text/plain
name: Plain Text File
puid: x-fmt/111
record: 1234-abcd
impact: 3
- filename: /mnt/data/de/ad/be/ef/data
format:
endangered: false
risk: 1
mime: application/zip
name: ZIP Format
puid: x-fmt/263
record: abcd-1234
impact: 4
- filename: /mnt/data/de/ad/be/ef/data#README.txt
format:
endangered: false
risk: 1
mime: text/plain
name: Plain Text File
puid: x-fmt/111
record: abcd-1234
impact: 3
- filename: /mnt/data/de/ad/be/ef/data#results.csv
format:
endangered: false
risk: 1
mime: text/csv
name: Comma Separated Values
puid: x-fmt/18
record: abcd-1234
impact: 5
```
Note that the contents of the ZIP archive are inspected as well, with `#` as the delimiter between the archive's filename and the contained file's name.
### Known results
The "known results" file can be used to override the detected file format information from siegfried per file (by filename).
The structure of this file is very similar to that of the usual results file described above, with a few minor differences.
Each entry can specify whether or not the format is `safe`, which will be taken into consideration when reporting "endangered" files.
Further, it can optionally provide information about the actual file `format`.
If present, this information will override the format information as reported by siegfried.
Example:
```yaml
- filename: /mnt/data/de/ad/be/ef/data#something.idk
format:
puid: fmt/729
name: SQLite Database File Format
mime: application/x-sqlite3
endangered: false
- filename: /mnt/data/de/ad/be/ef/data#anotherthing.bin
format:
puid: fmt/899
name: Windows Portable Executable
mime: application/vnd.microsoft.portable-executable
endangered: true
safe: true
- filename: /mnt/data/de/ad/be/ef/data#program.exe
safe: true
```
This example will override the file format detected by siegfried with the supplied values for the first two files.
For the second and third file, it will also mark the files as explicitly "safe" which will prevent formatscaper to report them as "endangered", even if their format is otherwise known to be.
## Generating an input file from Invenio
The required information is relatively straight-forward to generate using `invenio shell`:
@@ -246,19 +218,3 @@ with open("record-files.yaml", "w") as f:
The above script simply lists all files associated with any published record, but does not consider any unpublished drafts.
Such changes are very straight-forward to implement though.
## Filtering results
To filter results in the shell, you can use the command [`yq`](https://github.com/mikefarah/yq) (which is a [`jq`](https://github.com/jqlang/jq) wrapper for YAML documents).
Show only the endangered files:
```sh
yq 'map(select(.format.endangered == true))' results.yaml
```
Filtering results per record:
```sh
yq 'map(select(.record == "<RECORD-ID>"))' results.yaml
```
Loading