Skip to content

Glossary

Anomaly

  • Something that deviates from the standard, normal, or expected. This can be in the form of a single data point, record, or a batch of data

Accuracy

  • The data represents the real-world values they are expected to model.

Catalog Operation

  • used to read fundamental metadata from a Datastore required for the proper functioning of subsequent Operations such as Profile, Hash and Scan

Comparison

  • An evaluation to determine if the structure and content of the source and target Datastores match

Comparison Runs

  • An action to perform a comparison

Completeness

  • Required fields are fully populated.

Conformity

  • Alignment of the content to the required standards, schemas, and formats.

Connectors

  • Components that can be easily connected to and used to integrate with other applications and databases. Common uses include sending and receiving data.

Info

We can connect to any Apache Spark accessible datastore. If you have a datastore we don’t yet support, talk to us! We currently support: Files (CSV, JSON, XLSX, Parquet) on Object Storage (S3, Azure Blob, GCS); ETL/ELT Providers (Fivetran, Stitch, Airbyte, Matillion – and any of their connectors!); Data Warehouses (BigQuery, Snowflake, Redshift); Data Pipelining (Airflow, DBT, Prefect), Databases (MySQL, PostgreSQL, MSSQL, SQLite, etc.) and any other JDBC source

Consistency

  • The value is the same across all datastores within the organization.

Container (of a Datastore)

  • the uniquely named abstractions within a Datastore that hold data adhering to a known schema. The Containers within a RDBMS are tables, the containers in a filesystem are well formatted files, etc…

Data-at-rest

  • Data that is stored in a database, warehouse, file system, data lake, or other datastore.

Data Drift

  • Changes in a data set’s properties or characteristics over time.

Data-in-flight

  • Data that is on the move, transporting from one location to another, such as through a message queue, API, or other pipeline

Data Lake

  • ​​A centralized repository that allows you to store all your structured and unstructured data at any scale. (**)

Data Quality

  • Ensuring data is free from errors, including duplicates, inaccuracies, inappropriate fields, irrelevant data, missing elements, non-conforming data, and poor data entry.

Data Quality Check

  • aka "Check" is an expression regarding the values of a Container that can be evaluated to determine whether the actual values are expected or not.

Datastore

  • Where data is persisted in a database, file system, or other connected retrieval systems. You can check more in what is a Datastore.

Data Warehouse

  • A system that aggregates data from different sources into a single, central, consistent datastore to support data analysis, data mining, artificial intelligence (AI), and machine learning.

Distinctness (of a Field)

  • the fraction of distinct values (appear at least once) to total values that appear in a Field

Enrichment Datastore

  • Additional properties that are added to a data set to enhance its meaning. Qualytics enrichment includes whether a record is anomalous, what caused it to be an anomaly, what characteristics it was expected to have, and flags that allow other systems to act upon the data.

Favorite

  • users can mark instances of an abstraction (Field, Container, Datastore, Check, Anomaly, etc..) as a personalized favorite to ensure it ranks higher in default ordering and is prioritized in other personalized views & workflows.

Compute Daemon

  • An application that protects a system from contamination due to inputs, reducing the likelihood of contamination from an outside source. The Compute Daemon will quarantine data that is problematic, allowing the user to act upon quarantined items.

Incremental Identifier

  • a Field that can be used to group the records in the Table Container into distinct ordered Qualytics Partitions in support of incremental operations upon those partitions__:

  • a whole number - then all records with the same partition_id value are considered part of the same partition

  • a float or timestamp - then all records between two defined values are considered part of the same partition (the defining values will be set by incremental scan/profile business logic) Since Qualytics Partitions are required to support Incremental Operations, an Incremental Identifier is required for a Table Container to support incremental Operations.

Incremental Scan Operation

  • a Scan Operation where only new records (inserted since the last Scan Operation) are analyzed. The underlying Container must support determining which records are new for incremental scanning to be a valid option for it.

Inference Engine

  • after Compute Daemon gathers all the metadata generated by a profiling operation, it feeds that metadata into our Inference Engine. The inference engine then initiates a "true machine learning" (specifically, this is referred to as Inductive Learning) process whereby the available customer data is partitioned into a training set and a testing set. The engine applies numerous machine learning models & techniques to the training data in an effort to discover well-fitting data quality constraints. Those inferred constraints are then filtered by testing them against the held out testing set & only those that assert true are converted to inferred data quality Checks.

Metadata

  • Data about other data, including descriptions and additional information.

Object Storage

  • A type of data storage used for handling large amounts of unstructured data managed as objects.

Operation

  • the asynchronous (often long running) tasks that operate on Datastores are collectively referred to as "Operations." Examples include Catalog, Profile, Hash, and Scan

Partition Identifier

  • a Field that can be used by Spark to group the records in a Dataframe into smaller sets that fit within our Spark worker’s memory. The ideal Partition Identifier is an Incremental Identifier of type datetime since that can serve as both but we identify alternatives should that not be available.

Pipeline

  • A workflow that processes and moves data between systems.

Precision

  • Your data is the resolution that is expected- How tightly can you define your data?

Profile Operation

  • an operation that generates metadata describing the characteristics of your actual data values.

Profiling

  • The process of collecting statistics on the characteristics of a dataset involving examining, analyzing, and reviewing the data.

Proprietary Algorithms

  • A procedure utilizing a combination of processes, tools, or systems of interrelated connections that are the property of a business or individual in order to solve a problem. (**)

Quality Score

  • a measure of data quality calculated at the Field, Container, and Datastore level. Quality Scores are recorded as time-series enabling you to track movement over time. You can read more in Quality Scoring.

Qualytics App

  • aka "App" this is the user interface for our Product delivered as a web application

Qualytics Deployment

  • a single instance of our product (the k8s cluster, postgres database, hub/app/compute daemon pods, etc…)

Qualytics Compute Daemon

  • aka "Compute Daemon" this is the layer of our Product that connects to Datastores and directly operates on users’ data.

Qualytics Implementation

  • a customer’s Deployment plus any associated integrations

Qualytics Surveillance Hub

  • aka "Hub" this is the layer of our Product that exposes an Application Programming Interface (API).

Qualytics Partition

  • the smallest grouping of records that can be incrementally processed. For DFS datastores, each file is a Qualytics Partition. For JDBC datastores, partitions are defined by each table’s incremental identifier values.

Record (of a Container)

  • a distinct set of values for all Fields defined for a Container (e.g. a row of a table)

Schema

  • The organization of data in a datastore. This could be the columns of a table, the header of a CSV file, the fields in a JSON file, or other structural constraints.

Schema Differences

  • Differences in the organization of information between two datastores that are supposed to hold the same content.

Source

  • The origin of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets extracted.

Tag

  • Users can assign Tags Datastores, Profiles (Files, Tables, Containers), Checks and Anomalies. Add a Description and Assign a Weight #. The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.

Target

  • The destination of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets loaded.

Third-party data

  • Data acquired from a source outside of your company which may not be controlled by the same data quality processes. You may not have the same level of confidence in the data and it may not be as trustworthy as internally vetted datasets.

Timeliness

  • It can be calculated as the time between when information should be available and when it is actually available, focused on if data is available when it’s expected.

Volumetrics

  • Data has the same size and shape across similar cycles. It includes statistics about the size of a data set including calculations or predictions on the rate of change over time.

Weight

  • The weight value directly correlates with the level of importance, where a higher weight indicates higher significance.

Last update: April 7, 2024