*** Wartungsfenster jeden ersten Mittwoch vormittag im Monat ***

Skip to content
Snippets Groups Projects
Maximilian Moser's avatar
Moser, Maximilian authored
* also add support for databases as a storage and make it the default
  over yaml/pickle
3ba4ec72
History
Name Last commit Last update
formatscaper
.gitignore
README.md
pyproject.toml

Formatscaper

Formatscaper is a tool for generating an overview of the file format landscape composed of the uploads in our research data repository.

The aim here is to assist us with the task of digital preservation; specifically with the task of identifying uploaded files which are in danger of becoming "extinct".

Dependencies

Usage

Formatscaper is designed to build up the required context over its lifetime of use, so no particular setup is required. To get started, simply feed it a list of files to analyze in the expected format and let it run.

The most relevant file is formats.yaml, which is intended to receive limited manual tweaks over time (basically just setting the endangered flag for outdated formats). Every time formatscaper is run, it will update this file with formats that haven't been listed before. This means that the knowledge base about file formats gets extended over time.

Per default, formatscaper will create a summary of endangered files it encountered and print it to standard out. A more comprehensive summary for all encountered formats will be stored in a results file.

Since file format detection is effectively still based on heuristics, no identification procedure is infallible - sometimes, even the best guess is wrong. For such cases, we added a mechanism to override the result detected by siegfried on a per-file basis. More information for this can be found further down.

Example call, with a custom path for the sf binary:

pipenv run ./formatscaper.py --sf-binary "${GOPATH}/bin/sf" -i record-files.yaml

Rationale

Building file format landscapes

Tools for creating an overview of file format landscapes already exist, like c3po and fitsinn. However, they tend to come with more bells and whistles than we actually require, or are not quite ready for production use yet.

Using a single source of truth

Often, the file format identification is based on the output of several utilities. For instance, FITS wraps a number of individual tools and combines their output into a unified XML structure. Unfortunately, their results are often in disagreement with each other, which necessitates a de-conflicting strategy.

Instead of following this multi-tool approach, we've decided to rely on a single source of truth only - namely a tool called Siegfried. It seems to be competitive in its file format identification capabilities (under some definition of "competitive"). Also, it is being used as the source of truth by other software solutions in the space of digital preservation such as Archivematica and RODA. That's certainly good enough for us.

Detecting endangered formats

It seems to be generally agreed upon that a centrally managed "list of endangered file formats" would be a desirable thing to have. There have been attempts in creating centralized registries for file formats; for example PRONOM, GDFR and UDFR, and the "Just Solve the File Format Problem" wiki. Out of these, PRONOM is the most promising candidate. It even offers a field for the "risk" per format - however, this field does not seem to be populated for any of the registered formats. Unfortunately, this distinctive lack of availability leaves us little choice but to do it ourselves.

There is no way for us to know all the formats that exist in the world - but luckily, this is also not required! We only have to know about the formats that are part of our format landscape, which is a much more manageable task. Thus, we use a "local" list of file formats which is extended every time a new format is encountered. We manually review this list periodically and annotate formats with a hint about their endangerment status.

Storing the information outside of Invenio

Invenio provides fields for storing information about the format for each file. However, we often receive archives such as ZIP files which of course contain a series of other files. Storing information about the formats of these nested files is not a standard use case and thus there is no obvious or generally agreed-upon way to do it (yet).

Because we try to not extend the semantics of available constructs with non-standard custom meaning, we instead decided to keep this information external.

Example files

Input

The input for formatscaper needs to be a list of objects (in YAML format) describing the context of each file to investigate. This includes the URI of the file, its original file name (which gets discarded by Invenio), and the record which the file is a part of.

- filename: hosts
  uri: /etc/hosts
  record: 1234-abcd

- filename: FS Table
  uri: /etc/fstab
  record: abcd-1234

- filename: researchdata.zip
  uri: /mnt/data/de/ad/be/ef/data
  record: abcd-1234

Formats file

The formats file (e.g. formats.yaml) contains information about previously encountered file formats and their endangerment status. Some context (like a human-readable name and MIME type) per format is also provided here, primarily to make it more understandable for operators.

- endangered: false
  mime: text/plain
  name: Plain Text File
  puid: x-fmt/111
- endangered: false
  mime: application/zip
  name: ZIP Format
  puid: x-fmt/263
- endangered: true
  mime: null
  name: Adobe Illustrator CC 2020 Artwork
  puid: fmt/1864
- endangered: false
  mime: application/postscript
  name: Encapsulated PostScript File Format
  puid: fmt/124
- endangered: false
  mime: application/pdf
  name: Acrobat PDF 1.4 - Portable Document Format
  puid: fmt/18
- endangered: false
  mime: image/svg+xml
  name: Scalable Vector Graphics
  puid: fmt/92
- endangered: true
  mime: null
  name: null
  puid: UNKNOWN

The "primary key" for file formats are the PRONOM Persistent Unique Identifier (PUID).

Note that formatscaper bases its detection of format endangerment purely on this file. Whenever a new (previously unlisted) format is encountered, it will be added to this list. The file is rewritten every time formatscaper is run, so extra information like comments will be discarded.

Results

The results file (e.g. results.yaml) contains information about each investigated file and their identified formats, along with a notes about their endangerment status.

- filename: /etc/hosts
  format:
    endangered: false
    mime: text/plain
    name: Plain Text File
    puid: x-fmt/111
  record: 1234-abcd
- filename: /etc/environment
  format:
    endangered: false
    mime: text/plain
    name: Plain Text File
    puid: x-fmt/111
  record: 1234-abcd
- filename: /mnt/data/de/ad/be/ef/data
  format:
    endangered: false
    mime: application/zip
    name: ZIP Format
    puid: x-fmt/263
  record: abcd-1234
- filename: /mnt/data/de/ad/be/ef/data#README.txt
  format:
    endangered: false
    mime: text/plain
    name: Plain Text File
    puid: x-fmt/111
  record: abcd-1234
- filename: /mnt/data/de/ad/be/ef/data#results.csv
  format:
    endangered: false
    mime: text/csv
    name: Comma Separated Values
    puid: x-fmt/18
  record: abcd-1234

Note that the contents of the ZIP archive are inspected as well, with # as the delimiter between the archive's filename and the contained file's name.

Known results

The "known results" file can be used to override the detected file format information from siegfried per file (by filename). The structure of this file is very similar to that of the usual results file described above, with a few minor differences.

Each entry can specify whether or not the format is safe, which will be taken into consideration when reporting "endangered" files. Further, it can optionally provide information about the actual file format. If present, this information will override the format information as reported by siegfried.

Example:


- filename: /mnt/data/de/ad/be/ef/data#something.idk
  format:
    puid: fmt/729
    name: SQLite Database File Format
    mime: application/x-sqlite3
    endangered: false

- filename: /mnt/data/de/ad/be/ef/data#anotherthing.bin
  format:
    puid: fmt/899
    name: Windows Portable Executable
    mime: application/vnd.microsoft.portable-executable
    endangered: true
  safe: true

- filename: /mnt/data/de/ad/be/ef/data#program.exe
  safe: true

This example will override the file format detected by siegfried with the supplied values for the first two files. For the second and third file, it will also mark the files as explicitly "safe" which will prevent formatscaper to report them as "endangered", even if their format is otherwise known to be.

Generating an input file from Invenio

The required information is relatively straight-forward to generate using invenio shell:

import yaml
from invenio_rdm_records.proxies import current_rdm_records_service as svc

# get all (published) records in the system
rc = svc.record_cls
recs = [rc(rm.data, model=rm) for rm in rc.model_cls.query.all()]

# get the expected structure from the records
record_files = [
    {"record": r["id"], "filename": fn, "uri": entry.file.file.uri}
    for r in recs
    for fn, entry in r.files.entries.items()
    if r.files.entries
]

# serialize the information as YAML file
with open("record-files.yaml", "w") as f:
    yaml.dump(record_files, f)

The above script simply lists all files associated with any published record, but does not consider any unpublished drafts. Such changes are very straight-forward to implement though.

Filtering results

To filter results in the shell, you can use the command yq (which is a jq wrapper for YAML documents).

Show only the endangered files:

yq 'map(select(.format.endangered == true))' results.yaml

Filtering results per record:

yq 'map(select(.record == "<RECORD-ID>"))' results.yaml