iguanas.correlation_reduction
.AgglomerativeClusteringReducer¶
- class iguanas.correlation_reduction.AgglomerativeClusteringReducer(threshold: float, strategy: str, similarity_function: Callable, columns_performance=None)[source]¶
Removes similar columns (given a similarity function) by calculating the similarity matrix then iteratively running Agglomerative Clustering on the similarity matrix and dropping columns that are correlated. Only one column per cluster is kept.
- Parameters
- thresholdfloat
The median of the cluster’s similarity metric is compared against this threshold - if the median is greater than this threshold, the columns within the cluster are deemed correlated, with only the top performing column being kept.
- strategystr
Can be either ‘top_down’ or ‘bottom_up’. ‘top_down’ begins clustering from the top, with two clusters per iteration being calculated. ‘bottom_up’ begins clustering from the bottom, with half of the total number of columns per iteration being used to define the number of clusters.
- similarity_functionCallable
The similarity function to use for calculating the similarity between columns. It must return a dataframe containing the similarity matrix. See the similarity_functions module for out-of-the-box functions.
- columns_performancePandasSeriesType, optional
Series containing the performance metric of each column (e.g. Fbeta score). This is used to determine the top performing column per cluster. If not provided, a random column from the cluster will be kept. Defaults to None.
- Attributes
- columns_to_keep :List[str]
The final list of columns with the correlated columns removed.
- fit(X: iguanas.utils.typing.pandas.core.frame.DataFrame, print_clustermap=False) None [source]¶
Calculates the similar columns in the dataset X.
- Parameters
- XPandasDataFrameType
Dataframe to be reduced.
- print_clustermapbool, optional
If True, the clustermap at each iteration will be printed. Defaults to False.
- transform(X: iguanas.utils.typing.pandas.core.frame.DataFrame) iguanas.utils.typing.pandas.core.frame.DataFrame [source]¶
Removes similar columns from the dataset X.
- Parameters
- XPandasDataFrameType
Dataframe to be reduced.
- Returns
- PandasDataFrameType
Dataframe with the similar columns removed.
- fit_transform(X: iguanas.utils.typing.pandas.core.frame.DataFrame, print_clustermap=False) iguanas.utils.typing.pandas.core.frame.DataFrame [source]¶
Calculates the similar columns in the dataset X, then removes them.
- Parameters
- XPandasDataFrameType
Dataframe of binary columns.
- print_clustermapbool, optional
If True, the clustermap at each iteration will be printed. Defaults to False.
- Returns
- PandasDataFrameType
Dataframe of dissimilar binary columns.