
Tethys Database schema
Tethys database documentation
This documentation should help to setup and use the concentration database developed in the Tethys project.
Other help pages are Frequent Error Messages and Frequently Asked Questions.
Concept
To allow for a intuitive user interface for persons working in water management on the one hand and to allow several advanced technical functionalities, the database is build with two layers, implemented in different database schema:
- The technical layer which holds the actual data and is only visible to administrators. This layer consists of tables all placed in the schema "hidden". Data from different environmental compartments are combined in few tables based on common data structures. Data which were deleted by the user are still available here and can be restored in case of unintentional deletion.
- The user accessible layer which presents the data in a meaningful way to users. Here a different schema for each environmental compartment presents the data for this compartment. All data presentations are technically only views on the data in the schema "hidden", presenting those columns and rows relevant for this environmental compartment. Deleted data are not shown here any more.
Included data
The aim of the database is to be able to host all concentration monitoring data from different environmental compartments which are needed to generate input data for emission modelling with the regionalized pathway approach (e.g. MoRE model).
To make the database also useful to manage data which needs to be reported under the EU WISE system, some additional attributes were added to allow also storing concentration in biota and in sediments. Information about the WISE data format was derived from WISE-SoE_WaterQualityICM.pdf. Special views for WISE compatible data export might be developed. Only disaggregated data (single values) can be stored in the database.
The environmental compartments foreseen at the moment (extension is possible) and depicted in different schemata are:
- Surface water: Data from rivers, lakes and reservoirs including concentrations in the liquid phase, suspended particulate matter and biota (for WISE compatibility).
- Groundwater: Concentrations measured in springs and wells.
- Wastewater (ww): Concentrations in municipal and industrial wastewater treated and untreated and in different kinds of sewage sludge.
- Stromwater runoff (stw): Concentrations in stormwater runoff from urban surfaces sampled from storm sewers in the separate sewer system and combined sewer overflows (CSO) in the combined sewer system. Untreated wastewater sampled during storm events from combined sewers might be stored here or in the wastewater section. Untreated wastewater from dry weather periods should be stored in the wastewater section.
- atmospheric deposition (ad): Deposition from the atmosphere onto surfaces. Given either as deposition rates (mass per area and time) or as concentration in a collection container which was collecting deposition for a certain period. To calculate deposition rates from these concentrations, the exposed collection area, the exposition time and the total sample volume need to be known.
- soil: Substance content in soil samples (usually topsoil) given in mass of a substance per dry matter mass of soil.
- sediment (sed): Substance content in bottom sediment samples.
Structure
Schemata
The database is structured into different schemata, which hold the tables for the different environmental compartments and some schema for general information:
schema name | description |
---|---|
atdep | Atmospheric deposition: Measurements of substance transfer from atmosphere to surfaces. |
data | Data sources: Documentation of the data sources, owners, suppliers and licenses. |
determ | Determinants: Definition of determinants (substances & other parameters like pH) their classification and their environmental quality standards (EQS). |
groundw | Groundwater: Concentrations in water below the earths surface. |
hidden | Schema containing all the data table itself. Only accessible for administrators. |
public | The default schema created by PostgreSQl where all users have certain rights. If objects are created without assigning them a dedicated schema, they will be stored here. In this schema the quality checked data are presented for download. |
sed | Sediments: Concentrations in solid layers below the water column. |
soil | Soil: Concentrations in samples from (top)soil contributing to surface water pollution via soil erosion and via washout to groundwater. |
stormw | Stormwater runoff: Concentration in runnoff generated mainly by precipitation events. Either as runoff from specific urban surfaces, as runoff in the storm sewer in separate sewer systems or as combined sewer overflow (CSO) in the combined sewer system. |
surfw | Surfacewater: Concentrations measured in the liquid phase, biota or suspended particulate matter in surface waters: rivers, lakes, reservoirs, coastal waters ... |
wastew | Wastewater: Wastewater from municipal or industrial sources, treated or untreated. Including also concentration in sludge resulting from wastewater treatment. |
All data are stored in the schema hidden, for user interaction they are presented in the different schemata via database views.
Views in schemata besides public
The same data are presented in different views for different purpose:
- For data import purpose only the columns which can be filled during data import are shown, additional columns filled automatically during data import are hidden here (e.g. the numerical id, the time of import and the user importing the data). Here only data created by the user itself are visible and only data are visible in this views as long as they have not gone through the quality assessment workflow. These views are appended with the suffix _import.
- A different view on the data is available for the quality assessment for quality assessors. Here some additional data are included (like the user who created the data set) and only the two columns regarding quality assessment (quality_check_passed, comments_quality_assessment) can be updated. These views are appended with the suffix _qa.
- The data which were not accepted by the quality assessor are presented in a further view to the tethys_user for correction or deletion. These views are appended with the suffix _qa_rejected.
- To check all imported data independently of their quality assessment status, views without a suffix are available for every table. These Views are read-only, for changing or deleting data, please use the aforementioned views.
At the top of every schemata are three support views starting with a underscore _.
- The _column_definition view contains_ comprehensive information about a columns properties regarding its description, data type, constraints or keys and should therefore serve as a tool for importing activities.
- The _controlled_vocabularies contains predefined column entries originally stored in hidden and only maintained by tethys_ca. They serve as foreign key for _import views and are marked as such in the _column_definition view. The reason to limit their creation rights lies in their low diversity but simultaneously high number of references, which therefore enables the tethys_ca to preserve data integrity from below.
- The _table_definition view contains information about all views and tables of its schema and should therefore enhance your knowledge about the database.
Views in public
- Views for data evaluation and extraction: The views, with the same name of the environmental compartments from which its data originates, contain quality checked data, that was combined to wide tables displaying all collected information. They only miss columns which are recorded for data traceability, e.g. columns with the originally reported text strings, which were mapped on the controlled vocabularies.
- Views to support the quality assessment workflow: The two views _qa_awaiting and _rejected_data should provide an overview of not yet either quality assessed or corrected data. They aggregate information about data recently imported or unsuccessfully assessed regarding its schema, view, number of rows and its importer or respectively its quality assessor.
User rights management
User rights for working with the database depend on PostgreSQL roles. The following roles were defined:
user (e.g. skittlaus)
Every user gets a login name and password. With this account the user can connect to the database (see section ...). Technically speaking this is the session-user, which is recorded when creating, changing or deleting data. This role has no further privileges. To see, create, modify and delete data, the user needs to take an additional role by running the command SET ROLE tethys_... ;
.
tethys_reader
This role allows to see and download the quality-checked database content. It has no privileges to create, modify or delete data. Having this role the user can check the database without being able to create any harm to the data.
tethys_user
This role allows to create new data and modify/delete these data created by the user itself. Once the data are quality checked by a quality assessor, the user can not modify or delete the own data any more. The tethys_user inherits all rights from the tethys_reader.
tethys_qa
The quality assessor can see data created by other users. He can comment on these data and mark them as quality checked or send them back to revision. The tethys_qa role inherits all rights from the tethys_reader role.
Click to expand
Can the right for modifications be limited to specific columns? This would be perfect. Answer: YES
-- First, we create a function that checks whether the user is allowed to update certain columns. It does this by checking if some_columns, that he is not allowed to modify stayed the same. CREATE OR REPLACE FUNCTION can_bypassrls_in_columns(row_data your_table, new_data your_table) RETURNS BOOLEAN AS BEGIN IF current_user =<tethys_qa> THEN IF new_data.some_column = row_data.some_column THEN RETURN TRUE; ELSE RETURN FALSE; END IF; ELSE RETURN TRUE; END IF; END; LANGUAGE plpgsql;
CREATE POLICY bypassrls_on_col_policy ON your_table FOR UPDATE TO <tethys_qa> USING (can_bypassrls_in_columns(OLD, NEW)) WITH CHECK (true);--the script to enable BYPASSRLS for <tethys_user> should look like this*** ALTER TABLE your_table ENABLE ROW LEVEL SECURITY; ALTER TABLE your_table FORCE ROW LEVEL SECURITY; GRANT BYPASSRLS (every_column, that_he_is_allowed, to_modify) ON your_table TO <tethys_qa>;SQL_Umsetzung: References, BYPASSRLS restricted*** TRUE, WITH INHERIT TRUE.
tethys_ca
The content admin has the right to see, modify and delete all data in the database. Additionally to the data visible for other roles, he can see and modify the tables in the schema "hidden", where all data are stored. Thus he can can change controlled vocabulary and restore accidentally deleted data.
The tethys_ca inherits all rights from the tethys_user role.
Click to expand
SQL_Umsetzung: BYPASSRLS unrestricted, WITH INHERIT TRUE.tethys_admin
The admin role can create new users, grant roles to users and change the structure of the database.
The tethys_admin inherits all rights from the tethys_ca role.
Click to expand
SQL_Umsetzung: Superuser status → Create, Truncate, Trigger, Alter system, create role status, createDB status, Replication, WITH INHERIT FALSE, WITH GRANT OPTION TRUE, Alter default priviligesData quality assurance
The following procedures are implemented to ascertain data quality in the database:
- Traceability: Every record is automatically amended with the information who created it (column created_by) and at what time (column created_at) and who last modified it (column updated_by) and at which time (column updated_by). To document how pre-existing data were mapped on the controlled vocabulary of the database, for several attributed there is an additional text-column, where the original value should be stored. This column has the suffix _reported. Deleted data from the tables containing the data are not deleted when user delete them, but marked as deleted and kept hidden from the ordinary user ("soft-deletion"). They can be restored by the administrators on request.
Constraints: This database makes use of five user relevant constraints.
- Primary key constraints: Ensures that uniqueness as well as a not-null condition of a column entry or a combination of column entries are satisfied. As user are not supposed to import the id column as it is automatically generated, its only relevance to the users is, that the id column needs to get selected as custom key when deleting or updating data (compare issue #10).
- Not Null constraints: Ensures that a not-null condition of a column entry or a combination of column entries are satisfied. In the _column_definition view those columns are marked in the mandatory column.
- Check constraint: Ensures the satisfaction of certain conditions based on logical queries or the existence of other column entries in this row.
Example 1
Check constraints for numeric values: Wherever possible numeric values are checked on insertion for their plausibility (e.g.areas, discharges, inhabitants) to be within a meaningful range, usually \\\\\\\> 0 and smaller a maximum value. The catchment area of a river must be larger than 0 km² and smaller than 7000000 km² (size of the Amazon river basin).SQL_Umsetzung: CONSTRAINT check_catchment_area_range CHECK ( catchment_area_km2 \\\> 0 AND catchment_area_km2 \\\< 7000000 )Example 2
Values are checked based on values in other columns: If a measurement is marked as below an analytical limit, the value of the limit must be given in the corresponding column. If a measurement is marked as below the analytical limit of detection (LOD), the LOD must be given in the corresponding column.SQL_Umsetzung: CONSTRAINT check_lod_available CHECK ((lod \\\\\\\> 0 AND value_below_lod IS TRUE) OR (value_below_lod IS NOT TRUE))- Unique key constraint: Ensures that uniqueness of a column entry or a combination of column entries are satisfied. Some unique keys are supposed to store aggregated information about its row. If this is the case, it can be seen in the description of the column. In the __column_definition_ view those columns are marked in the __col_constraints_ column with (u).
Example 1
A combination of columns are the unique key SQL_Umsetzung: CONSTRAINT unique_measurement UNIQUE (compartment, determinant, identifier_analytical_method, identifier_lab, identifier_sample)Example 2
A unique key, in this case an identifier consits of several information of its row identifier_analytical_method: Identifier of an analytical laboratory method defined in the table determ.analytical_methods. A laboratory method in best case is defined by a Standard (ISO or other) and consists of sample preparation and sample analysis techniques with appropriate measures for quality assurance and a validation procedure.- Foreign key constraint: Ensures the case sensitive referencing of a column or a combination of column entries in the so called "child table" originally migrated from the "parent table", hence providing referential integrity. In the __column_definition_ view those columns are marked in the __col__constraints_ column with (f).
Example
A foreign key references one or in this example two columns of another table case sensitive. SQL_Umsetzung: ALTER TABLE hidden.samples ADD FOREIGN KEY (identifier_sampling_site, sampling_site_compartment) REFERENCES hidden.sampling_sites (identifier_sampling_site, compartment) ON UPDATE RESTRICT ON DELETE RESTRICT;- Controlled vocabulary: Those are a special version of foreign keys, here used for qualitative attributes: Whenever a qualitative attribute is recorded, controlled vocabularies (a list of allowed values for that column) are used. These assure, that the same thing is always called the same (and without typos) and no further data cleaning is necessary before data evaluation. Controlled vocabularies are stored in separate tables. These tables can not be changed by the ordinary user, but only by admins. Deletion in controlled vocabularies are not recorded in the database itself (no soft delete mechanism), here the admin is responsible to keep a backup. In the __column_definition_ view those columns are marked in the __controlled_vocabulary_ and. Possible entries for those columns can be seen in the __controlled_vocabularies_ view.
Example
The Accreditation status of measurements are managed by a boolean column with controlled vocabulary used according to the following schema:
- yes, fully: true
- no, but the values are of high reliability: false
- no: Exclude such data from data base
- unknown: NULL
Especially during importing it is crucial to keep track of the present constraints in table. To simplify this task the following graphic should serve as support sheet.
- Quality assessment workflow: As described below.
Quality assessment workflow
After a user (role tethys_user) uploaded data they are shown in the import-view until they have gone through the quality assessment procedure. As long as they are shown here, they can be updated or deleted by the user. 30 minutes after the last modification of a data row, it becomes available in the quality assessment workflow: it is shown to the quality assessors in the quality-assessment view.
The quality assessor can now check the data, decide if the data and meta data are of sufficient quality or if further updates are needed. If the data are fine, the quality assessor marks them as quality checked and they become available to all users of the database for evaluation. If the data needs revision, the quality assesor fills his remarks in the column comments_quality_assessment and sets the column quality_check_passed to FALSE.
The data needing revision are presented to the user who created them in an dedicated view (suffix _qa_rejected) where they can be updated or deleted. The quality assesor still see them in the quality assessment workflow as long as they are not changed. When changes occur, they are not visible to the quality assessor until 30 minutes after the last modification, when they occur again.
The 30 minutes embargo time data are not shown to the quality assessor after creation/modification should aviod, that the quality assessor already process data, which are still under creation/modification by the user.
The quality assessor itself can not change the data. He only provides feedback to the user who created the data.
Technical Details
The quality assessment workflow is based on three columns in the tables:- quality_assessor: The person doing the quality assessment. Will be filled automatically with the session-user when a user in the role tethys_qa updates the other two columns.
- comments_quality_assessment: Issues found by the quality assessor can be noted here as feedback to the user.
- quality_check_passed (boolean): This column indicates the progress of the quality assessment:
- NULL = quality assessment not done yet.
- FALSE = quality assessment done, but issues found which need further clarifications/corrections.
- TRUE = quality assessment passed. The data can be used for further evaluation. The user who created the data loses the right to change or delete the data.
Order of data import (Referencing other tables)
Data between different tables are connected by id-columns, so called foreign keys. E.g. in the measurements table each measurement of a concentration is referencing a sample in the samples table by the sample_identifier key. Data can only be stored in the database, if all referenced data are already available. This means there is a given order of data import resulting from the data structure:
- The controlled vocabulary need to be imported by the admins.
- Tables only referencing controlled vocabularies can be filled:
- data.suppliers
- data.licenses
- data.laboratories
- det.determinants
- det.analytical_methods
- Next these tables can be filled:
- data.sources
- det.environmental_quality_standards
- det.classifications
- det.relations
- Next these tables can be filled:
- surfw.water_bodies
- groundw.water_bodies
- wastew.wwtp
- soil.sampling_sites
- Next these tables can be filled:
- xxx.sampling_sites
- Next these tables can be filled:
- xxx.samples
- surfw.online_measurements
- Finally the tables
- xxx.measurements
NOTE: After updating or deleting rows, you might be asked to select a key (compare issue #10). Please select Custom Key and then choose the id column. This needs to be done only once for every table on every computer. Additionally, please not that the table soil_sample_belonging_to_site has no id column, the equivalent choice are the two identifier columns of sample and site.
Technical aspects not important for basic users
Table partitioning and use of composite primary keys
The tables with measurements and samples are using "table partitioning" in the background to store data from different environmental compartments in different sub tables and thus reduce time for data retrieval. For this reason the compartment column is included in the primary key together with the generated id column.
Furthermore the composite primary key is used to assure, that measurements, samples and sampling sites have always matching compartment information to avoid e.g. a groundwater sample be assigned to a wastewater sampling site.
To allow the compartment to be included twice as foreign key referencing once the compartments table directly and once the compartment as foreign key referencing e.g. the sampling sites compartment from the samples table, so called "generated columns" are used to fill the additional columns and check constraints to assure, that the compartment information matches.
Steps to set up the database from DDL scripts
-
Export the DDL script from ERMaster -
Perform "Search and replace" for 'SESSION_USER' by SESSION_USER in the DDL script to make this default value a variable instead of the text string. -
Add table partitioning if you expect the tables to become big. Find the table creation statements in the DDL script and append them as follows:
Click to expand
CREATE TABLE hidden.meas
(...) PARTITION BY LIST(compartment);
CREATE TABLE hidden.meas_groundw PARTITION OF hidden.meas FOR VALUES IN ('groundwater');
CREATE TABLE hidden.meas_surfw PARTITION OF hidden.meas FOR VALUES IN ('surfacewater');
CREATE TABLE hidden.meas_wastew PARTITION OF hidden.meas FOR VALUES IN ('wastewater');
CREATE TABLE hidden.meas_stormw PARTITION OF hidden.meas FOR VALUES IN ('stormwater runoff');
CREATE TABLE hidden.meas_soil PARTITION OF hidden.meas FOR VALUES IN ('soil');
CREATE TABLE hidden.meas_def PARTITION OF hidden.meas DEFAULT;
CREATE TABLE hidden.samples
(...) PARTITION BY LIST(compartment);
CREATE TABLE hidden.samples_groundw PARTITION OF hidden.samples FOR VALUES IN ('groundwater');
CREATE TABLE hidden.samples_surfw PARTITION OF hidden.samples FOR VALUES IN ('surfacewater');
CREATE TABLE hidden.samples_wastew PARTITION OF hidden.samples FOR VALUES IN ('wastewater');
CREATE TABLE hidden.samples_stormw PARTITION OF hidden.samples FOR VALUES IN ('stormwater runoff');
CREATE TABLE hidden.samples_soil PARTITION OF hidden.samples FOR VALUES IN ('soil');
CREATE TABLE hidden.samples_def PARTITION OF hidden.samples DEFAULT;
-
Run scripts in \SQL_scripts\Make_scripts and pay attention to comments in the beginning of the scripts, e.g. putting the name of your database.
Special case of compartment dependant obligatory columns
In the case of columns, that are only mandatory for certain compartments, the NOT-NULL constraint is due to its table specific comprehensiveness not applicable for compartment specific views. Hence, the required constraints are now realised as compartment specific check_constraint with names ending with the string "_given", which makes them distinguishable in a SQL query, and therefore embeddable also in the _column_definition view.