API Reference
summarize(df, columns=[], embedding_column_map={}, partition_key='', previous_summaries=[])
This function computes partition-wide summary statistics for the given columns. df can have multiple partitions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame
|
Dataframe to summarize. |
required |
columns |
typing.List[str]
|
List of columns to generate summary statistics for. Must be a subset of df.columns. If empty, previous_summaries must not be empty. |
[]
|
embedding_column_map |
typing.Dict[str, str]
|
Dictionary of embedding key to embedding value column. Keys and values must be in df.columns. If empty, previous_summaries must not be empty. |
{}
|
partition_key |
str
|
Name of column to partition the dataframe by. Must be in df. columns. Can be empty if no partitioning is desired, or if the dataframe represents a single partition. If empty, previous_summaries must not be empty. |
''
|
previous_summaries |
typing.List[Summary]
|
List of Summary objects representing previous partition summaries. |
[]
|
Returns:
Type | Description |
---|---|
typing.List[Summary]
|
typing.List[Summary]: List of Summary objects, one per distinct partition found in df. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If any column in |
compute_embeddings(column, column_type)
Computes embeddings for a Series with the huggingface/transformers library. We use the clip-ViT-B-32 model. This is an optional function; we recommend you compute embeddings yourself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
pd.Series
|
Series to compute embeddings for. Must be of string type. Can contain either paths to files or text. |
required |
column_type |
str
|
Type of the column. Must be "text" or "image". |
required |
Returns:
Type | Description |
---|---|
pd.Series
|
pd.Series: Series of embeddings to add to your DataFrame. |
Summary
summary: pd.DataFrame
property
Dataframe containing the summary statistics.
embeddings_summary: pd.DataFrame
property
Dataframe containing the embeddings summary statistics if there are embeddings, otherwise None.
partition_key: str
property
Partition key column.
partition: str
property
Partition value.
columns: typing.List[str]
property
Columns for which summary statistics were computed.
non_embedding_columns: typing.List[str]
property
Columns for which summary statistics were computed. Ignores embedding columns.
embedding_examples(embedding_key_column)
Returns examples in each embedding cluster for the given embedding key column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_key_column |
str
|
Column name representing the embedding key. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If there are no embedding examples. |
ValueError
|
If the embedding key column does not exist. |
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: Examples in each embedding cluster. Contains the columns partition_key, embedding_key_column, embedding_value_column, and cluster. |
embedding_centroids(embedding_key_column)
Returns embedding centroids for the given embedding key column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_key_column |
str
|
Column name representing the embedding key. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If there are no embedding examples. |
ValueError
|
If the embedding key column does not exist. |
Returns:
Type | Description |
---|---|
np.ndarray
|
np.ndarray: Matrix of embedding centroids, size (num_clusters, embedding_dim). |
statistics()
Returns list of statistics computed for each column:
- coverage: Fraction of rows that are not null.
- mean: Mean of the column.
- p50: Median of the column.
- num_unique_values: Number of unique values in the column.
- occurrence_ratio: Ratio of the most common value to all other values.
- p95: 95th percentile of the column.
value()
Combines the summary and embeddings summary into a single dataframe.
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: Summary including embeddings, if exists. |
__str__()
String representation of the object's value (i.e., summary).
Usage: print(summary)
detect_drift(current_summary, previous_summaries, validity=[], cluster=True, k=3)
Computes whether the current partition summary has drifted from previous summaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
current_summary |
Summary
|
Partition summary for current partition. |
required |
previous_summaries |
typing.List[Summary]
|
Previous partition summaries. |
required |
validity |
typing.List[int]
|
Indicator list identifying which partition summaries are valid. 1 if valid, 0 if invalid. If empty, we assume all partition summaries are valid. Must be empty or equal to length of previous_summaries. |
[]
|
cluster |
bool
|
Whether or not to cluster columns in summaries. Increases runtime but also increases precision in drift detection. Only engaged if summaries have more than 10 columns. Defaults to True. |
True
|
k |
int
|
Number of nearest neighbor partitions to inspect. Defaults to 3. |
3
|
Returns (DriftResult): DriftResult object with score and score percentile.
DriftResult
summary: Summary
property
Summary of the partition.
neighbor_summaries: typing.List[Summary]
property
Summaries of the nearest neighbors of the partition.
score: float
property
Distance from the partition to its k nearest neighbors.
score_percentile: float
property
Percentile of the partition's score in the distribution of all scores.
is_drifted: bool
property
Indicates whether the partition is drifted or not, compared to previous partitions. This is determined by the percentile of the partition's score in the distribution of all scores. The threshold is 95%.
all_scores: pd.Series
property
Scores of all previous partitions.
clustering: typing.Dict[int, typing.List[str]]
property
Clustering of the columns based on their partition summaries and meaning of column names (determined via embeddings). Returns a dictionary with cluster numbers as keys and lists of columns as values.
drifted_examples(embedding_key_column)
Returns some examples from the partition that are most drifted from nearest neighbors in the embedding space in previous partitions.
Throws an error if the embedding_key_column isn't a valid embedding key column, or if there are no embedding columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_key_column |
str
|
Column that represents the embedding key (e.g., text, image). |
required |
Returns:
Type | Description |
---|---|
typing.Dict[str, pd.DataFrame]
|
typing.Dict[str, pd.DataFrame]: Dictionary with two keys: "drifted_examples" and "corresponding_examples". The value of each key is a dataframe with columns "partition_key", "embedding_key_column", and "embedding_value_column". |
drill_down(sort_by_cluster_score=False, average_embedding_columns=True)
Compute the columns with highest magnitude anomaly scores. Anomaly scores are computed as the z-score of the column with respect to previous partition summary statistics.
The resulting dataframe has the following schema (column, statistic are indexes):
- column: Name of the column
- statistic: Name of the statistic
- z-score: z-score of the column
- cluster: Cluster number that the column belongs to (if clustering was performed)
- abs(z-score-cluster): absolute value of the average z-score of the column in the cluster (if clustering was performed)
Use the drifted_columns
method first, since drifted_columns
deduplicates columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sort_by_cluster_score |
bool
|
Whether to sort by cluster z-score. Defaults to False. |
False
|
average_embedding_columns |
bool
|
Whether to average statistics across embedding dimensions. Defaults to True. |
True
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: Dataframe with columns with highest magnitude anomaly scores. Sorted by the magnitude of the z-score for a column. If clustering was performed, the dataframe will be sorted by the magnitude of the z-score in the cluster before the column score. |
drifted_columns(limit=10, average_embedding_columns=True)
Returns the top limit columns that have drifted. The resulting dataframe has the following schema (column is an index):
- column: Name of the column
- statistic: Name of the statistic
- z-score: z-score of the column
- cluster: Cluster number of the column (if clustering was performed)
- abs(z-score-cluster): z-score of the column in the cluster (if clustering was performed)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limit |
int
|
Limit for number of drifted columns to return. Defaults to 10. |
10
|
average_embedding_columns |
bool
|
Whether to average statistics across embedding dimensions. Defaults to True. |
True
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
pd.DataFrame: Dataframe with columns with highest magnitude z-scores. If clustering was performed, the dataframe will also contain the z-score in the cluster and the cluster number. Each column is deduplicated, so only the statistic with the highest magnitude z-score is returned. |
__str__()
Prints the drift score, percentile, and the top drifted columns.
type_to_statistics(t)
Returns the statistics that are computed for a given type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
t |
str
|
Type (one of "int", "float", "string", "embedding"). |
required |
Returns:
Type | Description |
---|---|
typing.List[str]
|
typing.List[str]: List of statistics that are computed for the type. Partition summaries will have NaNs for statistics that are not computed. |
Raises:
Type | Description |
---|---|
ValueError
|
If the type is unknown. |