API Reference

`summarize(df, columns=[], embedding_column_map={}, partition_key='', previous_summaries=[])`

This function computes partition-wide summary statistics for the given columns. df can have multiple partitions.

Parameters:

Name	Type	Description	Default
`df`	`pd.DataFrame`	Dataframe to summarize.	required
`columns`	`typing.List[str]`	List of columns to generate summary statistics for. Must be a subset of df.columns. If empty, previous_summaries must not be empty.	`[]`
`embedding_column_map`	`typing.Dict[str, str]`	Dictionary of embedding key to embedding value column. Keys and values must be in df.columns. If empty, previous_summaries must not be empty.	`{}`
`partition_key`	`str`	Name of column to partition the dataframe by. Must be in df. columns. Can be empty if no partitioning is desired, or if the dataframe represents a single partition. If empty, previous_summaries must not be empty.	`''`
`previous_summaries`	`typing.List[Summary]`	List of Summary objects representing previous partition summaries.	`[]`

Returns:

Type	Description
`typing.List[Summary]`	typing.List[Summary]: List of Summary objects, one per distinct partition found in df.

Raises:

Type	Description
`ValueError`	If `partition_key` is "group".
`ValueError`	If `columns is empty` and `previous_summaries` is empty.
`ValueError`	If `partition_key`is empty and `previous_summaries` is empty.
`ValueError`	If `partition_key` is not in `df.columns`.
`ValueError`	If any column in `columns` is not in `df.columns`.

`compute_embeddings(column, column_type)`

Computes embeddings for a Series with the huggingface/transformers library. We use the clip-ViT-B-32 model. This is an optional function; we recommend you compute embeddings yourself.

Parameters:

Name	Type	Description	Default
`column`	`pd.Series`	Series to compute embeddings for. Must be of string type. Can contain either paths to files or text.	required
`column_type`	`str`	Type of the column. Must be "text" or "image".	required

Returns:

Type	Description
`pd.Series`	pd.Series: Series of embeddings to add to your DataFrame.

`Summary`

`summary: pd.DataFrame` `property`

Dataframe containing the summary statistics.

`embeddings_summary: pd.DataFrame` `property`

Dataframe containing the embeddings summary statistics if there are embeddings, otherwise None.

`partition_key: str` `property`

Partition key column.

`partition: str` `property`

Partition value.

`columns: typing.List[str]` `property`

Columns for which summary statistics were computed.

`non_embedding_columns: typing.List[str]` `property`

Columns for which summary statistics were computed. Ignores embedding columns.

`embedding_examples(embedding_key_column)`

Returns examples in each embedding cluster for the given embedding key column.

Parameters:

Name	Type	Description	Default
`embedding_key_column`	`str`	Column name representing the embedding key.	required

Raises:

Type	Description
`ValueError`	If there are no embedding examples.
`ValueError`	If the embedding key column does not exist.

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: Examples in each embedding cluster. Contains the columns partition_key, embedding_key_column, embedding_value_column, and cluster.

`embedding_centroids(embedding_key_column)`

Returns embedding centroids for the given embedding key column.

Parameters:

Name	Type	Description	Default
`embedding_key_column`	`str`	Column name representing the embedding key.	required

Raises:

Type	Description
`ValueError`	If there are no embedding examples.
`ValueError`	If the embedding key column does not exist.

Returns:

Type	Description
`np.ndarray`	np.ndarray: Matrix of embedding centroids, size (num_clusters, embedding_dim).

`statistics()`

Returns list of statistics computed for each column:

coverage: Fraction of rows that are not null.
mean: Mean of the column.
p50: Median of the column.
num_unique_values: Number of unique values in the column.
occurrence_ratio: Ratio of the most common value to all other values.
p95: 95th percentile of the column.

`value()`

Combines the summary and embeddings summary into a single dataframe.

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: Summary including embeddings, if exists.

`str()`

String representation of the object's value (i.e., summary).

Usage: print(summary)

`detect_drift(current_summary, previous_summaries, validity=[], cluster=True, k=3)`

Computes whether the current partition summary has drifted from previous summaries.

Parameters:

Name	Type	Description	Default
`current_summary`	`Summary`	Partition summary for current partition.	required
`previous_summaries`	`typing.List[Summary]`	Previous partition summaries.	required
`validity`	`typing.List[int]`	Indicator list identifying which partition summaries are valid. 1 if valid, 0 if invalid. If empty, we assume all partition summaries are valid. Must be empty or equal to length of previous_summaries.	`[]`
`cluster`	`bool`	Whether or not to cluster columns in summaries. Increases runtime but also increases precision in drift detection. Only engaged if summaries have more than 10 columns. Defaults to True.	`True`
`k`	`int`	Number of nearest neighbor partitions to inspect. Defaults to 3.	`3`

Returns (DriftResult): DriftResult object with score and score percentile.

`DriftResult`

`summary: Summary` `property`

Summary of the partition.

`neighbor_summaries: typing.List[Summary]` `property`

Summaries of the nearest neighbors of the partition.

`score: float` `property`

Distance from the partition to its k nearest neighbors.

`score_percentile: float` `property`

Percentile of the partition's score in the distribution of all scores.

`is_drifted: bool` `property`

Indicates whether the partition is drifted or not, compared to previous partitions. This is determined by the percentile of the partition's score in the distribution of all scores. The threshold is 95%.

`all_scores: pd.Series` `property`

Scores of all previous partitions.

`clustering: typing.Dict[int, typing.List[str]]` `property`

Clustering of the columns based on their partition summaries and meaning of column names (determined via embeddings). Returns a dictionary with cluster numbers as keys and lists of columns as values.

`drifted_examples(embedding_key_column)`

Returns some examples from the partition that are most drifted from nearest neighbors in the embedding space in previous partitions.

Throws an error if the embedding_key_column isn't a valid embedding key column, or if there are no embedding columns.

Parameters:

Name	Type	Description	Default
`embedding_key_column`	`str`	Column that represents the embedding key (e.g., text, image).	required

Returns:

Type	Description
`typing.Dict[str, pd.DataFrame]`	typing.Dict[str, pd.DataFrame]: Dictionary with two keys: "drifted_examples" and "corresponding_examples". The value of each key is a dataframe with columns "partition_key", "embedding_key_column", and "embedding_value_column".

`drill_down(sort_by_cluster_score=False, average_embedding_columns=True)`

Compute the columns with highest magnitude anomaly scores. Anomaly scores are computed as the z-score of the column with respect to previous partition summary statistics.

The resulting dataframe has the following schema (column, statistic are indexes):

column: Name of the column
statistic: Name of the statistic
z-score: z-score of the column
cluster: Cluster number that the column belongs to (if clustering was performed)
abs(z-score-cluster): absolute value of the average z-score of the column in the cluster (if clustering was performed)

Use the drifted_columns method first, since drifted_columns deduplicates columns.

Parameters:

Name	Type	Description	Default
`sort_by_cluster_score`	`bool`	Whether to sort by cluster z-score. Defaults to False.	`False`
`average_embedding_columns`	`bool`	Whether to average statistics across embedding dimensions. Defaults to True.	`True`

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: Dataframe with columns with highest magnitude anomaly scores. Sorted by the magnitude of the z-score for a column. If clustering was performed, the dataframe will be sorted by the magnitude of the z-score in the cluster before the column score.

`drifted_columns(limit=10, average_embedding_columns=True)`

Returns the top limit columns that have drifted. The resulting dataframe has the following schema (column is an index):

column: Name of the column
statistic: Name of the statistic
z-score: z-score of the column
cluster: Cluster number of the column (if clustering was performed)
abs(z-score-cluster): z-score of the column in the cluster (if clustering was performed)

Parameters:

Name	Type	Description	Default
`limit`	`int`	Limit for number of drifted columns to return. Defaults to 10.	`10`
`average_embedding_columns`	`bool`	Whether to average statistics across embedding dimensions. Defaults to True.	`True`

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: Dataframe with columns with highest magnitude z-scores. If clustering was performed, the dataframe will also contain the z-score in the cluster and the cluster number. Each column is deduplicated, so only the statistic with the highest magnitude z-score is returned.

`str()`

Prints the drift score, percentile, and the top drifted columns.

`type_to_statistics(t)`

Returns the statistics that are computed for a given type.

Parameters:

Name	Type	Description	Default
`t`	`str`	Type (one of "int", "float", "string", "embedding").	required

Returns:

Type	Description
`typing.List[str]`	typing.List[str]: List of statistics that are computed for the type. Partition summaries will have NaNs for statistics that are not computed.

Raises:

Type	Description
`ValueError`	If the type is unknown.

API Reference

summarize(df, columns=[], embedding_column_map={}, partition_key='', previous_summaries=[])

compute_embeddings(column, column_type)

Summary

summary: pd.DataFrame property

embeddings_summary: pd.DataFrame property

partition_key: str property

partition: str property

columns: typing.List[str] property

non_embedding_columns: typing.List[str] property

embedding_examples(embedding_key_column)

embedding_centroids(embedding_key_column)

statistics()

value()

__str__()

detect_drift(current_summary, previous_summaries, validity=[], cluster=True, k=3)

DriftResult

summary: Summary property

neighbor_summaries: typing.List[Summary] property

score: float property

score_percentile: float property

is_drifted: bool property

all_scores: pd.Series property

clustering: typing.Dict[int, typing.List[str]] property

drifted_examples(embedding_key_column)

drill_down(sort_by_cluster_score=False, average_embedding_columns=True)

drifted_columns(limit=10, average_embedding_columns=True)

__str__()

type_to_statistics(t)

`summarize(df, columns=[], embedding_column_map={}, partition_key='', previous_summaries=[])`

`compute_embeddings(column, column_type)`

`Summary`

`summary: pd.DataFrame` `property`

`embeddings_summary: pd.DataFrame` `property`

`partition_key: str` `property`

`partition: str` `property`

`columns: typing.List[str]` `property`

`non_embedding_columns: typing.List[str]` `property`

`embedding_examples(embedding_key_column)`

`embedding_centroids(embedding_key_column)`

`statistics()`

`value()`

`str()`

`detect_drift(current_summary, previous_summaries, validity=[], cluster=True, k=3)`

`DriftResult`

`summary: Summary` `property`

`neighbor_summaries: typing.List[Summary]` `property`

`score: float` `property`

`score_percentile: float` `property`

`is_drifted: bool` `property`

`all_scores: pd.Series` `property`

`clustering: typing.Dict[int, typing.List[str]]` `property`

`drifted_examples(embedding_key_column)`

`drill_down(sort_by_cluster_score=False, average_embedding_columns=True)`

`drifted_columns(limit=10, average_embedding_columns=True)`

`str()`

`type_to_statistics(t)`