Skip to content

Aggregation Comparison

Definition

Verifies that the specified comparison operator evaluates true when applied to two aggregation expressions.

In-Depth Overview

The Aggregation Comparison is a rule that allows for the dynamic analysis of aggregations across different datasets. It empowers users to establish data integrity by ensuring that aggregate values meet expected comparisons, whether they are totals, averages, counts, or any other aggregated metric.

By setting a comparison between aggregates from potentially different tables or even source datastores, this rule confirms that relationships between data points adhere to business logic or historical data patterns. This is particularly useful when trying to validate interrelated financial reports, summary metrics, or when monitoring the consistency of data ingestion over time.

Field Scope

Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.

General Properties

Name Supported
Filter
Allows the targeting of specific data based on conditions
Coverage Customization
Allows adjusting the percentage of records that must meet the rule's conditions

The filter allows you to define a subset of data upon which the rule will operate.

It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.

Examples

Direct Conditions

Simply specify the condition you want to be met.

Correct usage
O_TOTALPRICE > 1000
C_MKTSEGMENT = 'BUILDING'
Incorrect usage
WHERE O_TOTALPRICE > 1000
WHERE C_MKTSEGMENT = 'BUILDING'

Combining Conditions

Combine multiple conditions using logical operators like AND and OR.

Correct usage
O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'
(L_SHIPDATE = '1998-09-02' OR L_RECEIPTDATE = '1998-09-01') AND L_RETURNFLAG = 'R'
Incorrect usage
WHERE O_ORDERPRIORITY = '1-URGENT' AND O_ORDERSTATUS = 'O'
O_TOTALPRICE > 1000, O_ORDERSTATUS = 'O'

Utilizing Functions

Leverage Spark SQL functions to refine and enhance your conditions.

Correct usage
RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - INSTR('-', O_ORDERPRIORITY)
) = 'URGENT'
LEVENSHTEIN(C_NAME, 'Supplier#000000001') < 7
Incorrect usage
RIGHT(
    O_ORDERPRIORITY,
    LENGTH(O_ORDERPRIORITY) - CHARINDEX('-', O_ORDERPRIORITY)
) = 'URGENT'
EDITDISTANCE(C_NAME, 'Supplier#000000001') < 7

Using scan-time variables

To refer to the current dataframe being analyzed, use the reserved word _qualytics_${containerId} where ${containerId} should be replaced with the numeric identifier of the container.

Correct usage
O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM qualytics_42
    WHERE O_TOTALPRICE > 1000
)
Incorrect usage
O_ORDERSTATUS IN (
    SELECT DISTINCT O_ORDERSTATUS
    FROM ORDERS
    WHERE O_TOTALPRICE > 1000
)

While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.

Important Note on _qualytics_${containerId}

The _qualytics_${containerId} keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using _qualytics_${containerId} may not encompass all entries from the target container.

Specific Properties

Facilitates the comparison between a target aggregate metric and a reference aggregate metric across different datasets.

Name Description
Target Aggregation
Specifies the aggregation expression to evaluate
Comparison
Select the comparison operator (e.g., greater than, less than, etc.)
Datastore
Identifies the source datastore for the reference aggregation
Table/File
Specifies the table or file for the reference aggregation
Reference Aggregation
Defines the reference aggregation expression to compare against
Reference Filter
Applies a filter to the reference aggregation if necessary

Info

Refers to the Filter Guide in the General Properties topic for examples of valid Spark SQL expressions.

Anomaly Types

Type Supported
Record
Flag inconsistencies at the row level
Shape
Flag inconsistencies in the overall patterns and distributions of a field

Example

Objective: Ensure that the total price of orders from the ORDERS table is always greater than the total discount given in the LINEITEM table.

Sample Data

O_ORDERKEY TOTAL_PRICE (ORDERS) TOTAL_DISCOUNT (LINEITEM)
1 50000 1000
2 25000 30000
3 75000 2000
4 15000 500

Anomaly Explanation

In the sample data above, the entries with O_ORDERKEY 2 and 4 do not satisfy the rule because the TOTAL_PRICE for these orders is not greater than the TOTAL_DISCOUNT, which violates the business logic that the total price should always exceed the total discount.

graph TD
A[Start] --> B[Retrieve Aggregated Values]
B --> C{Does Target and Reference aggregations meet comparison criteria?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
-- An illustrative SQL query demonstrating the rule applied to example datasets
with target as (
    select 
        o_orderkey, 
        sum(o_totalprice) as totalorderprice
    from 
        orders
    group by 
        o_orderkey
),
reference as (
    select 
        l_orderkey, 
        sum(l_discount) as totaldiscount
    from 
        lineitem
    group by 
        l_orderkey
)
select 
    o.o_orderkey, 
    o.totalorderprice, 
    l.totaldiscount
from 
    target o
join 
    reference l on o.o_orderkey = l.l_orderkey
where 
    o.totalorderprice <= l.totaldiscount;

Potential Violation Messages

Shape Anomaly

TOTAL_PRICE is not greater than TOTAL_DISCOUNT for O_ORDERKEY.


Last update: December 1, 2023