Changes

Kittlaus, Steffen · 81abc75a
--- a/home.md
+++ b/home.md
-Wiki for documentation of the Tethys concentration database.
+# Tethys-database documentation
\ No newline at end of file
+This documentation should help to setup and use the concentration database developed in the Tethys project.
+[[_TOC_]]
+## Concept
+To allow for a intuitive user interface for persons working in water management on the one hand and to allow several advanced technical functionalities, the database is build with two layers, implemented in different database schema:
+1.	The technical layer which holds the actual data and is only visible to administrators. This layer consists of tables all placed in the schema "hidden". Data from different environmental compartments are combined in few tables based on common data structures. Data which were deleted by the user are still available here and can be restored in case of unintentional deletion.
+2.	The user accessible layer which presents the data in a meaningful way to users. Here a different schema for each environmental compartment presents the data for this compartment. All data presentations are technically only views on the data in the schema "hidden", presenting those columns and rows relevant for this environmental compartment. Deleted data are not shown here any more.
+## Included data
+The aim of the database is to be able to host all concentration monitoring data from different environmental compartments which are needed to generate input data for emission modelling with the regionalized pathway approach (e.g. MoRE model).
+To make the database also useful to manage data which needs to be reported under the EU WISE system, some additional attributes were added to allow also storing concentration in biota and in sediments. Information about the WISE data format was derived from [WISE-SoE_WaterQualityICM.pdf](https://gitlab.tuwien.ac.at/e226-1-working-group-river-basin-management/tethys/tethys-db-schema/-/blob/a8626906b46a229b8d2fb60cc36c605850c57615/documentation/WISE6_material/WISE-SoE_WaterQualityICM.pdf). Special views for WISE compatible data export might be developed. Only disaggregated data (single values) can be stored in the database.
+The environmental compartments foreseen at the moment (extension is possible) and depicted in different schemata are:
+- Surface water: Data from rivers, lakes and reservoirs including concentrations in the liquid phase, suspended particulate matter and biota (for WISE compatibility).
+- Groundwater: Concentrations measured in springs and wells.
+- Wastewater (ww): Concentrations in municipal and industrial wastewater treated and untreated.
+- Sewage sludge (swsl): Concentrations in different kinds of sewage sludge.
+- Stromwater runoff (stw): Concentrations in stormwater runoff from urban surfaces sampled from storm sewers in the separate sewer system and combined sewer overflows (CSO) in the combined sewer system. Untreated wastewater sampled during storm events from combined sewers might be stored here or in the wastewater section. Untreated wastewater from dry weather periods should be stored in the wastewater section.
+- atmospheric deposition (ad): Deposition from the atmosphere onto surfaces. Given either as deposition rates (mass per area and time) or as concentration in a collection container which was collecting deposition for a certain period. To calculte deposition rates from these concentrations, the exposed collection area, the exposition time and the total sample volume need to be known.
+- soil: Substance content in soil samples (usually topsoil) given in mass of a substance per dry matter mass of soil.
+- sediment (sed): Substance content in bottom sediment samples.
+## Structure
+The database is structured into different schemata, which hold the tables for the different environmental compartments and some schema for general information:
+| schema name | description |
+| ------      | ------ |
+| atdep       | Atmospheric deposition: Measurements of  substance transfer from atmosphere to surfaces. |
+| data       | Data sources: Documentation of the data sources, owners, suppliers and licenses. |
+| determ      | Determinants: Definition of determinants (substances & other parameters like pH) their classification and their environmental quality standards (EQS).|
+| groundw     | Groundwater: Concentrations in water below the earths surface.|
+| hidden      | Schema containing all the data table itself. Only accessible for administrators.|
+| public      | The default schema created by PostgreSQl where all users have certian rights. Not used for content of the tethys database. If objects are created without assigning them a dedicated schema, they will be stored here. |
+| sed         | Sediments: Concentrations in solid layers below the water column.|
+| sewsl       | Sewage sludge: Concentration in sludge resulting from wastewater treatment.|
+| soil        | Soil: Concentrations in samples from (top)soil contributing to surface water pollution via soil erosion and via washout to groundwater. |
+| spm       | Suspended particulate matter: Substance content measured in solid samples taken from suspended material transported by surface waters. |
+| stormw      | Stormwater runoff: Concentration in runnoff generated mainly by precipitation events. Either as runoff from specific urban surfaces, as runoff in the storm sewer in separate sewer systems or as combined sewer overflow (CSO) in the combined sewer system.|
+| surfw       | Surfacewater: Concentrations measured in the liquid phase of surface waters: rivers, lakes, reservoirs, coastal waters ... |
+| wastew      | Wastewater: Wastewater from municipal or industrial sources, treated or untreated. |
+All data are stored in the schema _hidden_, for user interaction they are presented in the different schemata via database views.
+The same data are presented in different views for different purpose:
+- For data import purpose only the columns which can be filled during data import are shown, additional columns filled automatically during data import are hidden here (e.g. the numerical id, the time of import and the user importing the data). Here only data created by the user itself are visible and only Data are visible in this views as long as they have not gone through the quality assessment workflow. These views are appended with the suffix __import_.
+- A different view on the data is available for the quality assessment for quality assessors. here some additional data are included (like the user who created the data set) and only the two columns regarding quality assessment (quality_check_passed, comments_quality_assessment) can be updated. These views are appended with the suffix __qa_.
+- The data which were not acceppted by the quality assessor are presented in a further view to the tethys_user for correction or deletion. These views are appended with the suffix __qa_rejected_.
+- View for data evaluation and extraction: ... filtered for quality checked data.
+## User rights management
+User rights for working with the database depend on PostgreSQL roles. The following roles were defined:
+### user (e.g. skittlaus)
+Every user gets a login name and password. With this account the user can connect to the database (see section ...). Technically speaking this is the session-user, which is recorded when creating,  changing or deleting data. This role has no further privileges. To see, create, modify and delete data, the user needs to take another role by running the command
+`SET ROLE tethys_... ;`.
+SQL_Umsetzung: Login, Alter default priviliges notwendig vor Erstellung. 
+### tethys_reader
+This role allows to see and download the quality-checked database content. It has no privileges to create, modify  or delete data. Having this role the user can check the database without being able to create any harm to the data.
+SQL_Umsetzung:  Read, Select, Usage, Temporary, Execute, Connect, SET TRUE, WITH INHERIT TRUE,  WITH GRANT OPTION FALSE. 
+### tethys_user
+This role allows to create new data and modify/delete these data created by the user itself. Once the data are quality checked by a quality assessor, the user can not modify or delete the own data any more.
+SQL_Umsetzung:  Insert, Update, Delete, WITH INHERIT TRUE. 
+### tethys_qa
+The quality assessor can see data created by other users. He can comment on these data and mark them as quality checked or send them back to revision.
+<details><summary>Click to expand</summary>
+Can the right for modifications be limited to specific columns? This would be perfect. 
+Answer: YES
+-- First, we create a function that checks whether the user is allowed to update certain columns. It does this by checkking if some_columns, that he is not allowed to modify stayed the same.
+CREATE OR REPLACE FUNCTION can_bypassrls_in_columns(row_data your_table, new_data your_table)
+RETURNS BOOLEAN AS $$
+BEGIN
+    IF current_user =<tethys_qa> THEN
+        IF new_data.some_column = row_data.some_column THEN
+            RETURN TRUE;
+        ELSE
+            RETURN FALSE;
+        END IF;
+    ELSE
+        RETURN TRUE;
+    END IF;
+END;
+$$ LANGUAGE plpgsql;
+CREATE POLICY bypassrls_on_col_policy ON your_table
+FOR UPDATE TO <tethys_qa>
+USING (can_bypassrls_in_columns(OLD, NEW))
+WITH CHECK (true);
+--the script to enable BYPASSRLS for <tethys_user> should look like this***
+ALTER TABLE your_table ENABLE ROW LEVEL SECURITY;
+ALTER TABLE your_table FORCE ROW LEVEL SECURITY;
+GRANT BYPASSRLS (every_column, that_he_is_allowed, to_modify) ON your_table TO <tethys_qa>;
+SQL_Umsetzung:  References, BYPASSRLS restricted*** TRUE, WITH INHERIT TRUE.
+</details>
+### tethys_ca
+The content admin has the right to see, modify and delete all data in the database. Additionally to the data visible for other roles, he can see and modify the tables in the schema "hidden", where all data are stored. Thus he can can change controlled vocabulary and restore accidentally deleted data.
+<details><summary>Click to expand</summary>
+SQL_Umsetzung:   BYPASSRLS unrestricted, WITH INHERIT TRUE.
+</details>
+### tethys_admin
+The admin role can create new users, grant roles to users and change the structure of the database.
+<details><summary>Click to expand</summary>
+SQL_Umsetzung:    Superuser status → Create, Truncate, Trigger, Alter system, create role status, createDB status, Replication, WITH INHERIT FALSE,  WITH GRANT OPTION TRUE, Alter default priviliges 
+</details>
+## Data quality assurance
+The following procedures are implemented to ascertain data quality in the database:
+- Traceability: Every record is automatically amended with the information who created it (column created_by) and at what time (column created_at) and who last modified it (column updated_by) and at which time (column updated_by). To document how pre-existing data were mapped on the controlled vocabulary of the database, for several attributed there is an additional text-column, where the original value should be stored. This column has the suffix _reported. Deleted data from the tables containing the data are not deleted when user delete them, but marked as deleted and kept hidden from the ordinary user ("soft-deletion"). They can be restored by the administrators on request. 
+- Usage of controlled vocabularies for qualitative attributes: Whenever a qualitative attribute is recorded, controlled vocabularies (a list of allowed values for that column) are used. These assure, that the same thing is always called the same (and without typos) and no further data cleaning is necessary before data evaluation. Controlled vocabularies are stored in separate tables. These tables can not be changed by the ordinary user, but only by admins. Deletion in controlled vocabularies are not recorded in the database itself (no soft delete mechanism), here the admin is responsible to keep a backup.
+- Check constraints for numeric values: Wherever possible numeric values are checked on insertion for their plausibility:
+  - areas, discharges, inhabitants... to be within a meaningful range, usually > 0 and smaller a maximum value.
+    <details><summary>Example</summary>
+    The catchment area of a river must be larger than 0 km² and smaller than 7000000 km² (size of the Amazon river basin).
+    CONSTRAINT check_catchment_area_range CHECK ( catchment_area_km2 > 0 AND catchment_area_km2 < 7000000 )
+    </details>
+  - further examples
+- Quality assessment workflow: As described below.
+### Quality assessment workflow
+After a user (role tethys_user) uploaded data they are shown in the import-view until they have gone through the quality assessment procedure. As long as they are shown here, they can be updated or deleted by the user. 30 minutes after the last modification of a data row, it becomes available in the quality assessment workflow: it is shown to the quality assessors in the quality-assessment view.
+The quality assessor can now check the data, decide if the data and meta data are of sufficient quality or if further updates are needed. If the data are fine, the quality assessor marks them as quality checked and they become available to all users of the database for evaluation. If the data needs revision, the quality assesor fills his remarks in the column comments_quality_assessment and sets the column quality_check_passed to FALSE.
+The data needing revision are presented to the user who created them in an dedicated view (suffix _qa_rejected) where they can be updated or deleted. The quality assesor still see them in the quality assessment workflow as long as they are not changed. When changes occur, they are not visible to the quality assessor until 30 minutes after the last modification, when they occur again.
+The 30 minutes embargo time data are not shown to the quality assessor after creation/modification should aviod, that the quality assessor already process data, which are still under creation/modification by the user.
+**The quality assessor itself can not change the data. He only provides feedback to the user who created the data.**
+<details><summary>Technical Details</summary>
+The quality assessment workflow is based on three columns in the tables:
+- quality_assessor: The person doing the quality assessment. Will be filled automatically with the session-user when a user in the role tethys_qa updates the other two columns.
+- comments_quality_assessment: Issues found by the quality assessor can be noted here as feedback to the user.
+- quality_check_passed (boolean): This column indicates the progress of the quality assessment: 
+  - NULL = quality assessment not done yet.
+  - FALSE = quality assessment done, but issues found which need further clarifications/corrections.
+  - TRUE = quality assessment passed. The data can be used for further evaluation. The user who created the data loses the right to change or delete the data.
+</details>
+## Referencing other tables
+Data between different tables are connected by id-columns, so called foreign keys. E.g. in the measurements table each measurement of a concentration is referencing a sample in the samples table by the sample_identifier key. Data can only be stored in the database, if all referenced data are already available. This means there is a given order of data import resulting from the data structure.
+1. The controlled vocabulary need to be imported by the admins.
+1. Tables only referencing controlled vocabularies can be filled:
+   - data.suppliers
+   - data.licenses
+   - det.determinants
+   - det.analytical_methods
+   - det.laboratories
+   - surfw.water_bodies
+   - groundw.water_bodies
+1. Next these tables can be filled:
+   - data.sources
+   - det.environmental_quality_standards
+   - det.classifications
+   - det.relations
+   - surfw.water_bodies
+   - groundw.water_bodies
+1. Next these tables can be filled:
+   - wastew.wwtp
+   - soil.sampling_sites
+1. Next these tables can be filled:
+   - xxx.sampling_sites
+1. Next these tables can be filled:
+   - xxx.samples
+   - surfw.online_measurements
+1. Finally the tables
+   - xxx.measurements
+<details><summary>Technical details on table partitioning and use of composite primary keys</summary>
+The tables with measurements and samples are using table partitioning in the background to store data from different environmental compartments in different sub tables and thus reduce time for data retrieval. For this reason the compartment column is included in the primary key together with the generated id column.
+Furthermore the composite primary key is used to assure, that measurements, samples and sampling sites have always matching compartment information to avoid e.g. a groundwater sample be assigned to a wastewater sampling site.
+To allow the compartment to be included twice as foreign key referencing once the compartments table directly and once the compartment as foreign key referencing e.g. the sampling sites compartment from the samples table, generated columns are used to fill the additional columns and check constraints to assure, that the compartment information matches. 
+</details>