From f0b74436a22438d5d66dce8db3b7748d2816483a Mon Sep 17 00:00:00 2001 From: Maximilian Moser <maximilian.moser@tuwien.ac.at> Date: Mon, 19 Feb 2024 17:17:36 +0100 Subject: [PATCH] Add new section in the README about the "known results" mechanism --- formatscaper/README.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/formatscaper/README.md b/formatscaper/README.md index 2428cb3..718f8f8 100644 --- a/formatscaper/README.md +++ b/formatscaper/README.md @@ -23,6 +23,10 @@ This means that the knowledge base about file formats gets extended over time. Per default, formatscaper will create a summary of endangered files it encountered and print it to standard out. A more comprehensive summary for all encountered formats will be stored in a [results file](#results). +Since file format detection is effectively still based on heuristics, no identification procedure is infallible - sometimes, even the best guess is wrong. +For such cases, we added a mechanism to override the result detected by siegfried on a per-file basis. +More information for this can be found [further down](#known-results). + Example call, with a custom path for the `sf` binary: ```sh @@ -181,6 +185,41 @@ The results file (e.g. `results.yml`) contains information about each investigat Note that the contents of the ZIP archive are inspected as well, with `#` as the delimiter between the archive's filename and the contained file's name. +### Known results + +The "known results" file can be used to override the detected file format information from siegfried per file (by filename). +The structure of this file is very similar to that of the usual results file described above, with a few minor differences. + +Each entry can specify whether or not the format is `safe`, which will be taken into consideration when reporting "endangered" files. +Further, it can optionally provide information about the actual file `format`. +If present, this information will override the format information as reported by siegfried. + +Example: +```yml + +- filename: /mnt/data/de/ad/be/ef/data#something.idk + format: + puid: fmt/729 + name: SQLite Database File Format + mime: application/x-sqlite3 + endangered: false + +- filename: /mnt/data/de/ad/be/ef/data#anotherthing.bin + format: + puid: fmt/899 + name: Windows Portable Executable + mime: application/vnd.microsoft.portable-executable + endangered: true + safe: true + +- filename: /mnt/data/de/ad/be/ef/data#program.exe + safe: true +``` + +This example will override the file format detected by siegfried with the supplied values for the first two files. +For the second and third file, it will also mark the files as explicitly "safe" which will prevent formatscaper to report them as "endangered", even if their format is otherwise known to be. + + ## Generating an input file from Invenio The required information is relatively straight-forward to generate using `invenio shell`: -- GitLab