From 1faacf6201af5d6f0f40d3f6c054adee3bab51f9 Mon Sep 17 00:00:00 2001
From: Maximilian Moser <maximilian.moser@tuwien.ac.at>
Date: Wed, 17 Jan 2024 17:52:22 +0100
Subject: [PATCH] Update formatscaper README with section about generating the
 input data

---
 formatscaper/README.md | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/formatscaper/README.md b/formatscaper/README.md
index d584315..2428cb3 100644
--- a/formatscaper/README.md
+++ b/formatscaper/README.md
@@ -181,6 +181,34 @@ The results file (e.g. `results.yml`) contains information about each investigat
 Note that the contents of the ZIP archive are inspected as well, with `#` as the delimiter between the archive's filename and the contained file's name.
 
 
+## Generating an input file from Invenio
+
+The required information is relatively straight-forward to generate using `invenio shell`:
+```python
+import yaml
+from invenio_rdm_records.proxies import current_rdm_records_service as svc
+
+# get all (published) records in the system
+rc = svc.record_cls
+recs = [rc(rm.data, model=rm) for rm in rc.model_cls.query.all()]
+
+# get the expected structure from the records
+record_files = [
+    {"record": r["id"], "filename": fn, "uri": entry.file.file.uri}
+    for r in recs
+    for fn, entry in r.files.entries.items()
+    if r.files.entries
+]
+
+# serialize the information as YAML file
+with open("record-files.yml", "w") as f:
+    yaml.dump(record_files, f)
+```
+
+The above script simply lists all files associated with any published record, but does not consider any unpublished drafts.
+Such changes are very straight-forward to implement though.
+
+
 ## Filtering results
 
 To filter results in the shell, you can use the command [`yq`](https://github.com/mikefarah/yq) (which is a [`jq`](https://github.com/jqlang/jq) wrapper for YAML documents).
-- 
GitLab