iguanas.correlation_reduction.AgglomerativeClusteringReducer

class iguanas.correlation_reduction.AgglomerativeClusteringReducer(threshold: float, strategy: str, similarity_function: Callable, columns_performance=None)[source]

Removes similar columns (given a similarity function) by calculating the similarity matrix then iteratively running Agglomerative Clustering on the similarity matrix and dropping columns that are correlated. Only one column per cluster is kept.

Parameters
thresholdfloat

The median of the cluster’s similarity metric is compared against this threshold - if the median is greater than this threshold, the columns within the cluster are deemed correlated, with only the top performing column being kept.

strategystr

Can be either ‘top_down’ or ‘bottom_up’. ‘top_down’ begins clustering from the top, with two clusters per iteration being calculated. ‘bottom_up’ begins clustering from the bottom, with half of the total number of columns per iteration being used to define the number of clusters.

similarity_functionCallable

The similarity function to use for calculating the similarity between columns. It must return a dataframe containing the similarity matrix. See the similarity_functions module for out-of-the-box functions.

columns_performancePandasSeriesType, optional

Series containing the performance metric of each column (e.g. Fbeta score). This is used to determine the top performing column per cluster. If not provided, a random column from the cluster will be kept. Defaults to None.

Attributes
columns_to_keep :List[str]

The final list of columns with the correlated columns removed.

fit(X: iguanas.utils.typing.pandas.core.frame.DataFrame, print_clustermap=False) None[source]

Calculates the similar columns in the dataset X.

Parameters
XPandasDataFrameType

Dataframe to be reduced.

print_clustermapbool, optional

If True, the clustermap at each iteration will be printed. Defaults to False.

transform(X: iguanas.utils.typing.pandas.core.frame.DataFrame) iguanas.utils.typing.pandas.core.frame.DataFrame[source]

Removes similar columns from the dataset X.

Parameters
XPandasDataFrameType

Dataframe to be reduced.

Returns
PandasDataFrameType

Dataframe with the similar columns removed.

fit_transform(X: iguanas.utils.typing.pandas.core.frame.DataFrame, print_clustermap=False) iguanas.utils.typing.pandas.core.frame.DataFrame[source]

Calculates the similar columns in the dataset X, then removes them.

Parameters
XPandasDataFrameType

Dataframe of binary columns.

print_clustermapbool, optional

If True, the clustermap at each iteration will be printed. Defaults to False.

Returns
PandasDataFrameType

Dataframe of dissimilar binary columns.