Skip to content

Overview of an enrichment datastore

An Enrichment Datastore is a user-managed storage location where the Qualytics platform records and accesses metadata through a set of system-defined tables. It is purpose-built to capture metadata generated by the platform's profiling and scanning operations.

Key Points

  1. Metadata Storage: The Enrichment Datastore acts as a dedicated mechanism for writing and retaining metadata that the platform generates. This includes information about anomalies, quality checks, field profiling, and additional details that enrich the source data.

  2. Feature Enablement: By using the Enrichment Datastore, the platform unlocks certain features such as the previewing of source records. For instance, when an anomaly is detected, the platform typically previews a limited set of affected records. For a comprehensive view and persistent access, the Enrichment Datastore captures and maintains a complete snapshot of the source records associated with the anomalies.

  3. User-Managed Location: While the Qualytics platform handles the generation and processing of metadata, the actual storage is user-managed. This means that the user maintains control over the Enrichment Datastore, deciding where and how this data is stored, adhering to their governance and compliance requirements.

  4. Insight and Reporting: Beyond storing metadata, the Enrichment Datastore allows users to derive actionable insights and develop custom reports for a variety of use cases, from compliance tracking to data quality improvement initiatives.

Table Types

The Enrichment Datastore contains several types of tables, each serving a specific purpose in the data enrichment and remediation process. These tables are categorized into:

  • Enrichment Tables
  • Remediation Tables
  • Metadata Tables

Enrichment Tables

When anomalies are detected, the platform writes metadata into two primary enrichment tables:

  • <enrichment_prefix>_failed_checks
  • <enrichment_prefix>_source_records
  • <enrichment_prefix>_scan_operations

_FAILED_CHECKS Table

Acts as an associative entity that consolidates information on failed checks, associating anomalies with their respective quality checks.

Columns

Name Data Type Description
QUALITY_CHECK_ID NUMBER Unique identifier for the quality check.
ANOMALY_UUID STRING UUID for the anomaly detected.
QUALITY_CHECK_MESSAGE STRING Message describing the quality check outcome.
SUGGESTED_REMEDIATION_FIELD STRING Field suggesting remediation.
SUGGESTED_REMEDIATION_VALUE STRING Suggested value for remediation.
SUGGESTED_REMEDIATION_SCORE FLOAT Score indicating confidence in remediation.
QUALITY_CHECK_RULE_TYPE STRING Type of rule applied for quality check.
QUALITY_CHECK_TAGS STRING Tags associated with the quality check.
QUALITY_CHECK_PARAMETERS STRING Parameters used for the quality check.
QUALITY_CHECK_DESCRIPTION STRING Description of the quality check.
OPERATION_ID NUMBER Identifier for the operation detecting anomaly.
DETECTED_TIME TIMESTAMP Timestamp when the anomaly was detected.
SOURCE_CONTAINER STRING Name of the source data container.
SOURCE_PARTITION STRING Partition of the source data.
SOURCE_DATASTORE STRING Datastore where the source data resides.

Info

This table is not characterized by unique ANOMALY_UUID or QUALITY_CHECK_ID values alone. Instead, the combination of ANOMALY_UUID and QUALITY_CHECK_ID serves as a composite key, uniquely identifying each record in the table.

_SOURCE_RECORDS Table

Stores source records in JSON format, primarily to enable the preview source record feature in the Qualytics App.

Columns

Name Data Type Description
SOURCE_CONTAINER STRING Name of the source data container.
SOURCE_PARTITION STRING Partition of the source data.
ANOMALY_UUID STRING UUID for the associated anomaly.
CONTEXT STRING Contextual information for the anomaly.
RECORD STRING JSON representation of the source record.

_SCAN_OPERATIONS Table

Captures and stores the results of every scan operation conducted on the Qualytics Platform.

Columns

Name Data Type Description
OPERATION_ID NUMBER Unique identifier for the scan operation.
DATASTORE_ID NUMBER Identifier for the source datastore associated with the operation.
CONTAINER_ID NUMBER Identifier for the container associated with the operation.
CONTAINER_SCAN_ID NUMBER Identifier for the container scan associated with the operation.
PARTITION_NAME STRING Name of the source partition on which the scan operation is performed.
INCREMENTAL BOOLEAN Boolean flag indicating whether the scan operation is incremental.
RECORDS_PROCESSED NUMBER Total number of records processed during the scan operation.
ENRICHMENT_SOURCE_RECORD_LIMIT NUMBER Maximum number of records written to the enrichment for each anomaly detected.
MAX_RECORDS_ANALYZED NUMBER Maximum number of records analyzed in the scan operation.
ANOMALY_COUNT NUMBER Total number of anomalies identified in the scan operation.
START_TIME TIMESTAMP Timestamp marking the start of the scan operation.
END_TIME TIMESTAMP Timestamp marking the end of the scan operation.
RESULT STRING Textual representation of the scan operation's status.
MESSAGE STRING Detailed message regarding the process of the scan operation.

Remediation Tables

When anomalies are detected in a container, the platform has the capability to create remediation tables in the Enrichment Datastore. These tables are detailed snapshots of the affected container, capturing the state of the data at the time of anomaly detection. They also include additional columns for metadata and remediation purposes. However, the creation of these tables depends upon the chosen remediation strategy during the scan operation.

Currently, there are three types of remediation strategies:

  • None: No remediation tables will be created, regardless of anomaly detection.
  • Append: Replicate source containers using an append-first strategy.
  • Overwrite: Replicate source containers using an overwrite strategy.

Note

The naming convention for the remediation tables follows the pattern of <enrichment_prefix>_remediation_<container_id>, where <enrichment_prefix> is user-defined during the Enrichment Datastore configuration and <container_name> corresponds to the original source container.

Illustrative Table

_{ENRICHMENT_CONTAINER_PREFIX}_REMEDIATION_{CONTAINER_ID}

This remediation table is an illustrative snapshot of the "Orders" container for reference purposes.

Name Data Type Description
_QUALYTICS_SOURCE_PARTITION STRING The partition from the source data container.
ANOMALY_UUID STRING Unique identifier of the anomaly.
_QUALYTICS_APPEARS_IN STRING Indicates whether the record came from the target or reference container in relation to the check definition.
ORDERKEY NUMBER Unique identifier of the order.
CUSTKEY NUMBER The customer key related to the order.
ORDERSTATUS CHAR The status of the order (e.g., 'F' for 'finished').
TOTALPRICE FLOAT The total price of the order.
ORDERDATE DATE The date when the order was placed.
ORDERPRIORITY STRING Priority of the order (e.g., 'urgent').
CLERK STRING The clerk who took the order.
SHIPPRIORITY INTEGER The priority given to the order for shipping.
COMMENT STRING Comments related to the order.

Note

In addition to capturing the original container fields, the platform includes three metadata columns designed to assist in the analysis and remediation process.

  • _QUALYTICS_SOURCE_PARTITION
  • ANOMALY_UUID
  • _QUALYTICS_APPEARS_IN
    • This optional column appears based on specific check rule types, such as Is Replica Of, to provide additional context.

Metadata Tables

The Qualytics platform enables users to manually export metadata into the enrichment datastore, providing a structured approach to data analysis and management. These metadata tables are structured to reflect the evolving characteristics of data entities, primarily focusing on aspects that are subject to changes.

Currently, the following assets are available for exporting:

  • _<enrichment_prefix>_export_anomalies
  • _<enrichment_prefix>_export_checks
  • _<enrichment_prefix>_export_field_profiles

Note

The strategy used for managing these metadata tables employs a create or replace approach, meaning that the export process will create a new table if one does not exist, or replace it entirely if it does. This means that any previous data will be overwritten.

For more detailed information on exporting metadata, please refer to the export documentation.

_EXPORT_ANOMALIES Table

Contains metadata from anomalies in a distinct normalized format. This table is specifically designed to capture the mutable states of anomalies, emphasizing their status changes.

Columns

Name Data Type Description
ID NUMBER Unique identifier for the anomaly.
CREATED TIMESTAMP Timestamp of anomaly creation.
UUID UUID Universal Unique Identifier of the anomaly.
TYPE STRING Type of the anomaly (e.g., 'shape').
STATUS STRING Current status of the anomaly (e.g., 'Active').
GLOBAL_TAGS STRING Tags associated globally with the anomaly.
CONTAINER_ID NUMBER Identifier for the associated container.
SOURCE_CONTAINER STRING Name of the source container.
DATASTORE_ID NUMBER Identifier for the associated datastore.
SOURCE_DATASTORE STRING Name of the source datastore.
GENERATED_AT TIMESTAMP Timestamp when the export was generated.

_EXPORT_CHECKS Table

Contains metadata from quality checks.

Columns

Name Data Type Description
ADDITIONAL_METADATA STRING JSON-formatted string containing additional metadata for the check.
COVERAGE FLOAT Represents the expected tolerance of the rule.
CREATED STRING Created timestamp of the check.
DELETED_AT STRING Deleted timestamp of the check.
DESCRIPTION STRING Description of the check.
FIELDS STRING Fields involved in the check separated by comma.
FILTER STRING Criteria used to filter data when asserting the check.
GENERATED_AT STRING Indicates when the export was generated.
GLOBAL_TAGS STRING Represents the global tags of the check separated by comma.
HAS_PASSED BOOLEAN Boolean indicator of whether the check has passed its last assertion .
ID NUMBER Unique identifier for the check.
INFERRED BOOLEAN Indicates whether the check was inferred by the platform.
IS_NEW BOOLEAN Flags if the check is new.
LAST_ASSERTED STRING Timestamp of the last assertion performed on the check.
LAST_EDITOR STRING Represents the last editor of the check.
LAST_UPDATED STRING Represents the last updated timestamp of the check.
NUM_CONTAINER_SCANS NUMBER Number of containers scanned.
PROPERTIES STRING Specific properties for the check in a JSON format.
RULE_TYPE STRING Type of rule applied in the check.
WEIGHT FLOAT Represents the weight of the check.
DATASTORE_ID NUMBER Identifier of the datastore used in the check.
CONTAINER_ID NUMBER Identifier of the container used in the check.
TEMPLATE_ID NUMBER Identifier of the template id associated tothe check.
IS_TEMPLATE BOOLEAN Indicates wheter the check is a template or not.
SOURCE_CONTAINER STRING Name of the container used in the check.
SOURCE_DATASTORE STRING Name of the datastore used in the check.

_EXPORT_CHECK_TEMPLATES Table

Contains metadata from check templates.

Columns

Name Data Type Description
ADDITIONAL_METADATA STRING JSON-formatted string containing additional metadata for the check.
COVERAGE FLOAT Represents the expected tolerance of the rule.
CREATED STRING Created timestamp of the check.
DELETED_AT STRING Deleted timestamp of the check.
DESCRIPTION STRING Description of the check.
FIELDS STRING Fields involved in the check separated by comma.
FILTER STRING Criteria used to filter data when asserting the check.
GENERATED_AT STRING Indicates when the export was generated.
GLOBAL_TAGS STRING Represents the global tags of the check separated by comma.
ID NUMBER Unique identifier for the check.
IS_NEW BOOLEAN Flags if the check is new.
IS_TEMPLATE BOOLEAN Indicates wheter the check is a template or not.
LAST_EDITOR STRING Represents the last editor of the check.
LAST_UPDATED STRING Represents the last updated timestamp of the check.
PROPERTIES STRING Specific properties for the check in a JSON format.
RULE_TYPE STRING Type of rule applied in the check.
TEMPLATE_CHECKS_COUNT NUMBER The count of associated checks to the template.
TEMPLATE_LOCKED BOOLEAN Indicates wheter the check template is locked or not.
WEIGHT FLOAT Represents the weight of the check.

_EXPORT_FIELD_PROFILES Table

Contains metadata from field profiles.

Columns

Name Data Type Description
APPROXIMATE_DISTINCT_VALUES FLOAT Estimated number of distinct values in the field.
COMPLETENESS FLOAT Ratio of non-null entries to total entries in the field.
CONTAINER_ID NUMBER Identifier for the container holding the field.
SOURCE_CONTAINER STRING Name of the container holding the field.
CONTAINER_STORE_TYPE STRING Storage type of the container.
CREATED STRING Date when the field profile was created.
DATASTORE_ID NUMBER Identifier for the datastore containing the field.
SOURCE_DATASTORE STRING Name of the datastore containing the field.
DATASTORE_TYPE STRING Type of datastore.
ENTROPY FLOAT Measure of randomness in the information being processed.
FIELD_GLOBAL_TAGS STRING Global tags associated with the field.
FIELD_ID NUMBER Unique identifier for the field.
FIELD_NAME STRING Name of the field being profiled.
FIELD_PROFILE_ID NUMBER Identifier for the field profile record.
FIELD_QUALITY_SCORE FLOAT Score representing the quality of the field.
FIELD_TYPE STRING Data type of the field.
FIELD_WEIGHT NUMBER Weight assigned to the field for quality scoring.
GENERATED_AT STRING Date when the field profile was generated.
HISTOGRAM_BUCKETS STRING Distribution of data within the field represented as buckets.
IS_NOT_NORMAL BOOLEAN Indicator of whether the field data distribution is not normal.
KLL STRING Sketch summary of the field data distribution.
KURTOSIS FLOAT Measure of the tailedness of the probability distribution.
MAX FLOAT Maximum value found in the field.
MAX_LENGTH FLOAT Maximum length of string entries in the field.
MEAN FLOAT Average value of the field's data.
MEDIAN FLOAT Middle value in the field's data distribution.
MIN FLOAT Minimum value found in the field.
MIN_LENGTH FLOAT Minimum length of string entries in the field.
NAME STRING Descriptive name of the field.
Q1 FLOAT First quartile in the field's data distribution.
Q3 FLOAT Third quartile in the field's data distribution.
SKEWNESS FLOAT Measure of the asymmetry of the probability distribution.
STD_DEV FLOAT Standard deviation of the field's data.
SUM FLOAT Sum of all numerical values in the field.
TYPE_DECLARED BOOLEAN Indicator of whether the field type is explicitly declared.
UNIQUE_DISTINCT_RATIO FLOAT Ratio of unique distinct values to the total distinct values.

Diagram

The diagram below provides a visual representation of the associations between various tables in the Enrichment Datastore. It illustrates how tables can be joined to track and analyze data across different processes.

Screenshot

Screenshot

Handling JSON and string splitting

SELECT
    PARSE_JSON(ADDITIONAL_METADATA):metadata_1::string AS Metadata1_Key1,
    PARSE_JSON(ADDITIONAL_METADATA):metadata_2::string AS Metadata2_Key1,
    PARSE_JSON(ADDITIONAL_METADATA):metadata_3::string AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    SPLIT_PART(FIELDS, ',', 1) AS Field1,
    SPLIT_PART(FIELDS, ',', 2) AS Field2,
    -- Add more lines as needed up to FieldN
    FILTER,
    GENERATED_AT,
    SPLIT_PART(GLOBAL_TAGS, ',', 1) AS Tag1,
    SPLIT_PART(GLOBAL_TAGS, ',', 2) AS Tag2,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    PARSE_JSON(PROPERTIES):allow_other_fields::string AS Property_AllowOtherFields,
    PARSE_JSON(PROPERTIES):assertion::string AS Property_Assertion,
    PARSE_JSON(PROPERTIES):comparison::string AS Property_Comparison,
    PARSE_JSON(PROPERTIES):datetime_::string AS Property_Datetime,
    -- Add more lines as needed up to Property
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";
SELECT
    (ADDITIONAL_METADATA::json ->> 'metadata_1') AS Metadata1_Key1,
    (ADDITIONAL_METADATA::json ->> 'metadata_2') AS Metadata2_Key1,
    (ADDITIONAL_METADATA::json ->> 'metadata_3') AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    (string_to_array(FIELDS, ','))[1] AS Field1,
    (string_to_array(FIELDS, ','))[2] AS Field2,
    -- Add more lines as needed up to FieldN
    FILTER,
    GENERATED_AT,
    (string_to_array(GLOBAL_TAGS, ','))[1] AS Tag1,
    (string_to_array(GLOBAL_TAGS, ','))[2] AS Tag2,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    (PROPERTIES::json ->> 'allow_other_fields') AS Property_AllowOtherFields,
    (PROPERTIES::json ->> 'assertion') AS Property_Assertion,
    (PROPERTIES::json ->> 'comparison') AS Property_Comparison,
    (PROPERTIES::json ->> 'datetime_') AS Property_Datetime,
    -- Add more lines as needed up to PropertyN
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";
SELECT
    (ADDITIONAL_METADATA->>'$.metadata_1') AS Metadata1_Key1,
    (ADDITIONAL_METADATA->>'$.metadata_2') AS Metadata2_Key1,
    (ADDITIONAL_METADATA->>'$.metadata_3') AS Metadata3_Key1,
    -- Add more lines as needed up to MetadataN
    CONTAINER_ID,
    COVERAGE,
    CREATED,
    DATASTORE_ID,
    DELETED_AT,
    DESCRIPTION,
    SUBSTRING_INDEX(FIELDS, ',', 1) AS Field1,
    -- Add more lines as needed up to FieldN
    SUBSTRING_INDEX(GLOBAL_TAGS, ',', 1) AS Tag1,
    -- Add more lines as needed up to TagN
    HAS_PASSED,
    ID,
    INFERRED,
    IS_NEW,
    IS_TEMPLATE,
    LAST_ASSERTED,
    LAST_EDITOR,
    LAST_UPDATED,
    NUM_CONTAINER_SCANS,
    (PROPERTIES->>'$.allow_other_fields') AS Property_AllowOtherFields,
    (PROPERTIES->>'$.assertion') AS Property_Assertion,
    (PROPERTIES->>'$.comparison') AS Property_Comparison,
    (PROPERTIES->>'$.datetime_') AS Property_Datetime,
    -- Add more lines as needed up to PropertyN
    RULE_TYPE,
    SOURCE_CONTAINER,
    SOURCE_DATASTORE,
    TEMPLATE_ID,
    WEIGHT
FROM "_EXPORT_CHECKS";

Usage Notes

  • Both metadata tables and remediation tables, are designed to be ephemeral and thus are recommended to be used as temporary datasets. Users are advised to move this data to a more permanent dataset for long-term storage and reporting.
  • The anomaly UUID in the remediation tables acts as a link to the detailed data in the _anomaly enrichment table. This connection not only shows the number of failed checks but also provides insight into each one, such as the nature of the issue, the type of rule violated, and associated check tags. Additionally, when available, suggested remediation actions, including suggested field modifications and values, are presented alongside a score indicating the suggested action's potential effectiveness. This information helps users to better understand the specifics of each anomaly related to the remediation tables.
  • The Qualytics platform is configured to capture and write a maximum of 10 rows of data per anomaly by default for both the _source_records enrichment table and the remediation tables. To adjust this limit, users can utilize the enrichment_source_record_limit parameter within the Scan Operation settings. This parameter accepts a minimum value of 10 but allows the specification of a higher limit, up to an unrestricted number of rows per anomaly. It is important to note that if an anomaly is associated with fewer than 10 records, the platform will only write the actual number of records where the anomaly was detected.

Last update: April 27, 2024