diff --git a/formatscaper/README.md b/formatscaper/README.md index d5843151f93923a8ab6f369e6dcd5565fe4c0df0..2428cb3185502ef320dc5bd8d0459068327a7ebf 100644 --- a/formatscaper/README.md +++ b/formatscaper/README.md @@ -181,6 +181,34 @@ The results file (e.g. `results.yml`) contains information about each investigat Note that the contents of the ZIP archive are inspected as well, with `#` as the delimiter between the archive's filename and the contained file's name. +## Generating an input file from Invenio + +The required information is relatively straight-forward to generate using `invenio shell`: +```python +import yaml +from invenio_rdm_records.proxies import current_rdm_records_service as svc + +# get all (published) records in the system +rc = svc.record_cls +recs = [rc(rm.data, model=rm) for rm in rc.model_cls.query.all()] + +# get the expected structure from the records +record_files = [ + {"record": r["id"], "filename": fn, "uri": entry.file.file.uri} + for r in recs + for fn, entry in r.files.entries.items() + if r.files.entries +] + +# serialize the information as YAML file +with open("record-files.yml", "w") as f: + yaml.dump(record_files, f) +``` + +The above script simply lists all files associated with any published record, but does not consider any unpublished drafts. +Such changes are very straight-forward to implement though. + + ## Filtering results To filter results in the shell, you can use the command [`yq`](https://github.com/mikefarah/yq) (which is a [`jq`](https://github.com/jqlang/jq) wrapper for YAML documents).