Skip to content

API Reference

summarize(df, columns=[], embedding_column_map={}, partition_key='', previous_summaries=[])

This function computes partition-wide summary statistics for the given columns. df can have multiple partitions.

Parameters:

Name Type Description Default
df pd.DataFrame

Dataframe to summarize.

required
columns typing.List[str]

List of columns to generate summary statistics for. Must be a subset of df.columns. If empty, previous_summaries must not be empty.

[]
embedding_column_map typing.Dict[str, str]

Dictionary of embedding key to embedding value column. Keys and values must be in df.columns. If empty, previous_summaries must not be empty.

{}
partition_key str

Name of column to partition the dataframe by. Must be in df. columns. Can be empty if no partitioning is desired, or if the dataframe represents a single partition. If empty, previous_summaries must not be empty.

''
previous_summaries typing.List[Summary]

List of Summary objects representing previous partition summaries.

[]

Returns:

Type Description
typing.List[Summary]

typing.List[Summary]: List of Summary objects, one per distinct partition found in df.

Raises:

Type Description
ValueError

If partition_key is "group".

ValueError

If columns is empty and previous_summaries is empty.

ValueError

If partition_keyis empty and previous_summaries is empty.

ValueError

If partition_key is not in df.columns.

ValueError

If any column in columns is not in df.columns.

compute_embeddings(column, column_type)

Computes embeddings for a Series with the huggingface/transformers library. We use the clip-ViT-B-32 model. This is an optional function; we recommend you compute embeddings yourself.

Parameters:

Name Type Description Default
column pd.Series

Series to compute embeddings for. Must be of string type. Can contain either paths to files or text.

required
column_type str

Type of the column. Must be "text" or "image".

required

Returns:

Type Description
pd.Series

pd.Series: Series of embeddings to add to your DataFrame.

Summary

summary: pd.DataFrame property

Dataframe containing the summary statistics.

embeddings_summary: pd.DataFrame property

Dataframe containing the embeddings summary statistics if there are embeddings, otherwise None.

partition_key: str property

Partition key column.

partition: str property

Partition value.

columns: typing.List[str] property

Columns for which summary statistics were computed.

non_embedding_columns: typing.List[str] property

Columns for which summary statistics were computed. Ignores embedding columns.

embedding_examples(embedding_key_column)

Returns examples in each embedding cluster for the given embedding key column.

Parameters:

Name Type Description Default
embedding_key_column str

Column name representing the embedding key.

required

Raises:

Type Description
ValueError

If there are no embedding examples.

ValueError

If the embedding key column does not exist.

Returns:

Type Description
pd.DataFrame

pd.DataFrame: Examples in each embedding cluster. Contains the columns partition_key, embedding_key_column, embedding_value_column, and cluster.

embedding_centroids(embedding_key_column)

Returns embedding centroids for the given embedding key column.

Parameters:

Name Type Description Default
embedding_key_column str

Column name representing the embedding key.

required

Raises:

Type Description
ValueError

If there are no embedding examples.

ValueError

If the embedding key column does not exist.

Returns:

Type Description
np.ndarray

np.ndarray: Matrix of embedding centroids, size (num_clusters, embedding_dim).

statistics()

Returns list of statistics computed for each column:

  • coverage: Fraction of rows that are not null.
  • mean: Mean of the column.
  • p50: Median of the column.
  • num_unique_values: Number of unique values in the column.
  • occurrence_ratio: Ratio of the most common value to all other values.
  • p95: 95th percentile of the column.

value()

Combines the summary and embeddings summary into a single dataframe.

Returns:

Type Description
pd.DataFrame

pd.DataFrame: Summary including embeddings, if exists.

__str__()

String representation of the object's value (i.e., summary).

Usage: print(summary)

detect_drift(current_summary, previous_summaries, validity=[], cluster=True, k=3)

Computes whether the current partition summary has drifted from previous summaries.

Parameters:

Name Type Description Default
current_summary Summary

Partition summary for current partition.

required
previous_summaries typing.List[Summary]

Previous partition summaries.

required
validity typing.List[int]

Indicator list identifying which partition summaries are valid. 1 if valid, 0 if invalid. If empty, we assume all partition summaries are valid. Must be empty or equal to length of previous_summaries.

[]
cluster bool

Whether or not to cluster columns in summaries. Increases runtime but also increases precision in drift detection. Only engaged if summaries have more than 10 columns. Defaults to True.

True
k int

Number of nearest neighbor partitions to inspect. Defaults to 3.

3

Returns (DriftResult): DriftResult object with score and score percentile.

DriftResult

summary: Summary property

Summary of the partition.

neighbor_summaries: typing.List[Summary] property

Summaries of the nearest neighbors of the partition.

score: float property

Distance from the partition to its k nearest neighbors.

score_percentile: float property

Percentile of the partition's score in the distribution of all scores.

is_drifted: bool property

Indicates whether the partition is drifted or not, compared to previous partitions. This is determined by the percentile of the partition's score in the distribution of all scores. The threshold is 95%.

all_scores: pd.Series property

Scores of all previous partitions.

clustering: typing.Dict[int, typing.List[str]] property

Clustering of the columns based on their partition summaries and meaning of column names (determined via embeddings). Returns a dictionary with cluster numbers as keys and lists of columns as values.

drifted_examples(embedding_key_column)

Returns some examples from the partition that are most drifted from nearest neighbors in the embedding space in previous partitions.

Throws an error if the embedding_key_column isn't a valid embedding key column, or if there are no embedding columns.

Parameters:

Name Type Description Default
embedding_key_column str

Column that represents the embedding key (e.g., text, image).

required

Returns:

Type Description
typing.Dict[str, pd.DataFrame]

typing.Dict[str, pd.DataFrame]: Dictionary with two keys: "drifted_examples" and "corresponding_examples". The value of each key is a dataframe with columns "partition_key", "embedding_key_column", and "embedding_value_column".

drill_down(sort_by_cluster_score=False, average_embedding_columns=True)

Compute the columns with highest magnitude anomaly scores. Anomaly scores are computed as the z-score of the column with respect to previous partition summary statistics.

The resulting dataframe has the following schema (column, statistic are indexes):

  • column: Name of the column
  • statistic: Name of the statistic
  • z-score: z-score of the column
  • cluster: Cluster number that the column belongs to (if clustering was performed)
  • abs(z-score-cluster): absolute value of the average z-score of the column in the cluster (if clustering was performed)

Use the drifted_columns method first, since drifted_columns deduplicates columns.

Parameters:

Name Type Description Default
sort_by_cluster_score bool

Whether to sort by cluster z-score. Defaults to False.

False
average_embedding_columns bool

Whether to average statistics across embedding dimensions. Defaults to True.

True

Returns:

Type Description
pd.DataFrame

pd.DataFrame: Dataframe with columns with highest magnitude anomaly scores. Sorted by the magnitude of the z-score for a column. If clustering was performed, the dataframe will be sorted by the magnitude of the z-score in the cluster before the column score.

drifted_columns(limit=10, average_embedding_columns=True)

Returns the top limit columns that have drifted. The resulting dataframe has the following schema (column is an index):

  • column: Name of the column
  • statistic: Name of the statistic
  • z-score: z-score of the column
  • cluster: Cluster number of the column (if clustering was performed)
  • abs(z-score-cluster): z-score of the column in the cluster (if clustering was performed)

Parameters:

Name Type Description Default
limit int

Limit for number of drifted columns to return. Defaults to 10.

10
average_embedding_columns bool

Whether to average statistics across embedding dimensions. Defaults to True.

True

Returns:

Type Description
pd.DataFrame

pd.DataFrame: Dataframe with columns with highest magnitude z-scores. If clustering was performed, the dataframe will also contain the z-score in the cluster and the cluster number. Each column is deduplicated, so only the statistic with the highest magnitude z-score is returned.

__str__()

Prints the drift score, percentile, and the top drifted columns.

type_to_statistics(t)

Returns the statistics that are computed for a given type.

Parameters:

Name Type Description Default
t str

Type (one of "int", "float", "string", "embedding").

required

Returns:

Type Description
typing.List[str]

typing.List[str]: List of statistics that are computed for the type. Partition summaries will have NaNs for statistics that are not computed.

Raises:

Type Description
ValueError

If the type is unknown.