Example
There are two functions exposed by the GATE module: summarize
and detect_drift
. summarize
computes partition summaries for a dataframe, and detect_drift
detects whether a new partition is drifted.
In this example, we'll demonstrate how to use GATE to detect drift in small synthetic dataset.
Dataset Creation
Our synthetic dataset will be created in Pandas. The partition key will be date
. There will be 10 partitions, and each partition will have 10,000 rows. There will be 3 columns. The last partition will have a different column distribution than the first 9 partitions.
import numpy as np
import pandas as pd
# create example date range
date_range = pd.date_range(start="2022-01-01", periods=10, freq="D")
# create example data for each column
int_col = np.random.randint(low=0, high=10, size=10000)
float_col = np.random.normal(loc=0, scale=1, size=10000)
string_col = np.random.choice(["A", "B", "C"], size=10000)
# combine data into a DataFrame
df_elems = []
for date in date_range:
date_data = {"date": date}
if date != date_range[-1]:
date_data = pd.DataFrame(
{
"date": [date] * len(int_col),
"int_col": int_col,
"float_col": float_col,
"string_col": string_col,
}
)
else:
# Change the distribution of the int column
date_data = pd.DataFrame(
{
"date": [date] * len(int_col),
"int_col": np.random.randint(low=10, high=20, size=10000),
"float_col": float_col,
"string_col": string_col
}
)
df_elems.append(date_data)
df = pd.concat(df_elems).reset_index(drop=True)
summarize
The summarize
function computes partition summaries for a dataframe. In addition to a Pandas dataframe of raw data, it accepts the partition key and a list of columns in the dataframe to compute statistics for. Or, one can specify a list of previous partition summaries instead of the partition key and column list, and GATE will infer the partition key and columns from the previous partition summaries.
The summarize
function returns a list of Summary
objects. Each Summary
object contains the partition summary and other metadata, and has a __str__
method that prints the summary in a human-readable format.
from gate import summarize
summaries = summarize(
df, partition_key="date", columns=["int_col", "float_col", "string_col"]
)
# len(summaries) == 10 because there are 10 distinct partitions
print(summaries[-1])
"""
date column coverage mean num_unique_values occurrence_ratio p50 p95
0 2022-01-10 float_col 1.0 0.015739 NaN NaN 0.019152 1.665352
1 2022-01-10 int_col 1.0 14.520700 10.0 0.1032 15.000000 19.000000
2 2022-01-10 string_col 1.0 NaN 3.0 0.3411 NaN NaN
"""
Note
You can access the summary data as a Pandas dataframe with the value
attribute of the Summary
object (i.e., summaries[-1].summary
).
detect_drift
The detect_drift
function detects whether a new partition is drifted. It accepts a new partition summary and list of previous partition summaries and returns a DriftResult
object. The DriftResult
object has a __str__
method that prints the drift result in a human-readable format.
from gate import detect_drift
drift_result = detect_drift(summaries[-1], summaries[:-1])
print(drift_result)
"""
Drift score: 6.3246 (100.00% percentile)
Top drifted columns:
statistic z-score
column
int_col p95 2.846050
float_col p95 0.000002
string_col coverage 0.000000
"""
The z-score represents the number of standard deviations away from the mean that the new partition is. In this case, the int col correctly has a high z-score. We recommend focusing on z-scores > 2.5 or < -2.5 when looking for drift.
If you want to cluster correlated columns, you can pass cluster = True
into detect_drift
. The DriftResult
object has a clustering
attribute that contains the clusters.
Note
The list of previous partition summaries must have at least one element. Best results are achieved when there are at least 5 previous partition summaries.
Real Dataset Example
For an end-to-end example on a real weather dataset, see the example notebook in the Github repository.