From 1faacf6201af5d6f0f40d3f6c054adee3bab51f9 Mon Sep 17 00:00:00 2001 From: Maximilian Moser <maximilian.moser@tuwien.ac.at> Date: Wed, 17 Jan 2024 17:52:22 +0100 Subject: [PATCH] Update formatscaper README with section about generating the input data --- formatscaper/README.md | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/formatscaper/README.md b/formatscaper/README.md index d584315..2428cb3 100644 --- a/formatscaper/README.md +++ b/formatscaper/README.md @@ -181,6 +181,34 @@ The results file (e.g. `results.yml`) contains information about each investigat Note that the contents of the ZIP archive are inspected as well, with `#` as the delimiter between the archive's filename and the contained file's name. +## Generating an input file from Invenio + +The required information is relatively straight-forward to generate using `invenio shell`: +```python +import yaml +from invenio_rdm_records.proxies import current_rdm_records_service as svc + +# get all (published) records in the system +rc = svc.record_cls +recs = [rc(rm.data, model=rm) for rm in rc.model_cls.query.all()] + +# get the expected structure from the records +record_files = [ + {"record": r["id"], "filename": fn, "uri": entry.file.file.uri} + for r in recs + for fn, entry in r.files.entries.items() + if r.files.entries +] + +# serialize the information as YAML file +with open("record-files.yml", "w") as f: + yaml.dump(record_files, f) +``` + +The above script simply lists all files associated with any published record, but does not consider any unpublished drafts. +Such changes are very straight-forward to implement though. + + ## Filtering results To filter results in the shell, you can use the command [`yq`](https://github.com/mikefarah/yq) (which is a [`jq`](https://github.com/jqlang/jq) wrapper for YAML documents). -- GitLab