Aggregation Comparison
Definition
Verifies that the specified comparison operator evaluates true when applied to two aggregation expressions.
In-Depth Overview
The Aggregation Comparison
is a rule that allows for the dynamic analysis of aggregations across different datasets. It empowers users to establish data integrity by ensuring that aggregate values meet expected comparisons, whether they are totals, averages, counts, or any other aggregated metric.
By setting a comparison between aggregates from potentially different tables or even source datastores, this rule confirms that relationships between data points adhere to business logic or historical data patterns. This is particularly useful when trying to validate interrelated financial reports, summary metrics, or when monitoring the consistency of data ingestion over time.
Field Scope
Calculated: The rule automatically identifies the fields involved, without requiring explicit field selection.
General Properties
Name | Supported |
---|---|
Filter Allows the targeting of specific data based on conditions |
|
Coverage Customization Allows adjusting the percentage of records that must meet the rule's conditions |
The filter allows you to define a subset of data upon which the rule will operate.
It requires a valid Spark SQL expression that determines the criteria rows in the DataFrame should meet. This means the expression specifies which rows the DataFrame should include based on those criteria. Since it's applied directly to the Spark DataFrame, traditional SQL constructs like WHERE clauses are not supported.
Examples
Direct Conditions
Simply specify the condition you want to be met.
Combining Conditions
Combine multiple conditions using logical operators like AND
and OR
.
Correct usage
Incorrect usage
Utilizing Functions
Leverage Spark SQL functions to refine and enhance your conditions.
Correct usage
Incorrect usage
Using scan-time variables
To refer to the current dataframe being analyzed, use the reserved word _qualytics_${containerId}
where ${containerId}
should be replaced with the numeric identifier of the container.
Correct usage
Incorrect usage
While subqueries can be useful, their application within filters in our context has limitations. For example, directly referencing other containers or the broader target container in such subqueries is not supported. Attempting to do so will result in an error.
Important Note on _qualytics_${containerId}
The _qualytics_${containerId}
keyword refers to the dataframe that's currently under examination. In the context of a full scan, this variable represents the entire target container. However, during incremental scans, it only reflects a subset of the target container, capturing just the incremental data. It's crucial to recognize that in such scenarios, using _qualytics_${containerId}
may not encompass all entries from the target container.
Specific Properties
Facilitates the comparison between a target
aggregate metric and a reference
aggregate metric across different datasets.
Name | Description |
---|---|
Target Aggregation |
Specifies the aggregation expression to evaluate |
Comparison |
Select the comparison operator (e.g., greater than, less than, etc.) |
Datastore |
Identifies the source datastore for the reference aggregation |
Table/File |
Specifies the table or file for the reference aggregation |
Reference Aggregation |
Defines the reference aggregation expression to compare against |
Reference Filter |
Applies a filter to the reference aggregation if necessary |
Info
Refers to the Filter Guide in the General Properties topic for examples of valid Spark SQL expressions.
Anomaly Types
Type | Supported |
---|---|
Record Flag inconsistencies at the row level |
|
Shape Flag inconsistencies in the overall patterns and distributions of a field |
Example
Objective: Ensure that the total price of orders from the ORDERS
table is always greater than the total discount given in the LINEITEM
table.
Sample Data
O_ORDERKEY | TOTAL_PRICE (ORDERS) | TOTAL_DISCOUNT (LINEITEM) |
---|---|---|
1 | 50000 | 1000 |
2 | 25000 | 30000 |
3 | 75000 | 2000 |
4 | 15000 | 500 |
Anomaly Explanation
In the sample data above, the entries with O_ORDERKEY
2 and 4 do not satisfy the rule because the TOTAL_PRICE
for these orders is not greater than the TOTAL_DISCOUNT
, which violates the business logic that the total price should always exceed the total discount.
graph TD
A[Start] --> B[Retrieve Aggregated Values]
B --> C{Does Target and Reference aggregations meet comparison criteria?}
C -->|Yes| D[End]
C -->|No| E[Mark as Anomalous]
E --> D
-- An illustrative SQL query demonstrating the rule applied to example datasets
with target as (
select
o_orderkey,
sum(o_totalprice) as totalorderprice
from
orders
group by
o_orderkey
),
reference as (
select
l_orderkey,
sum(l_discount) as totaldiscount
from
lineitem
group by
l_orderkey
)
select
o.o_orderkey,
o.totalorderprice,
l.totaldiscount
from
target o
join
reference l on o.o_orderkey = l.l_orderkey
where
o.totalorderprice <= l.totaldiscount;
Potential Violation Messages
Shape Anomaly
TOTAL_PRICE
is not greater than TOTAL_DISCOUNT
for O_ORDERKEY
.