AgglomerativeClusteringReducer Example

This notebook contains an example of how the AgglomerativeClusteringReducer class can be used to remove correlated rules from a dataset. It can also be used to remove correlated features from a rule set.

Requirements

To run, you’ll need the following:

  • A dataset or rule set (in the case of a rule set, you need to provide the binary columns of the rules as applied to a dataset)


Import packages

[1]:
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.metrics.pairwise import CosineSimilarity

import pandas as pd
import numpy as np

Read in data

Let’s read in some dummy data.

[2]:
X_train = pd.read_csv(
    'dummy_data/X_train.csv',
    index_col='eid'
)
X_test = pd.read_csv(
    'dummy_data/X_test.csv',
    index_col='eid'
)

Find correlated features

Set up class parameters

Now we can set our class parameters. Here we’re using the cosine similarity as the similarity metric (you can choose a different function from the metrics.pairwise module, or create your own - see the pairwise.ipynb example notebook for more information).

Please see the class docstring for more information on each parameter.

[3]:
cs = CosineSimilarity()
[4]:
params = {
    'threshold': 0.75,
    'strategy': 'bottom_up',
    'similarity_function': cs.fit,
    'columns_performance': None,
}

Instantiate class and run fit method

Once the parameters have been set, we can run the .fit() method to identify the columns that should be kept.

[5]:
agg = AgglomerativeClusteringReducer(**params)
agg.fit(X=X_train)

Outputs

The .fit() method does not return anything. However it does create the following attribute:

  • columns_to_keep: The final list of columns with the correlated columns removed.

[6]:
agg.columns_to_keep
[6]:
['account_number_num_fraud_transactions_per_account_number_7day',
 'account_number_num_order_items_per_account_number_lifetime',
 'account_number_avg_order_total_per_account_number_30day',
 'account_number_num_distinct_transaction_per_account_number_7day',
 'is_existing_user_0',
 'status_Pending',
 'is_billing_shipping_city_same_0',
 'num_order_items_IsNull',
 'order_total_IsNull']

Transform the dataset (or another dataset)

Use the .transform() method to reduce the original dataset (or a separate dataset) by removing the correlated columns.

[7]:
X_train_reduced = agg.transform(X_train)
[8]:
X_train.shape, X_train_reduced.shape
[8]:
((8894, 32), (8894, 9))
[9]:
X_test_reduced = agg.transform(X_test)
[10]:
X_test.shape, X_test_reduced.shape
[10]:
((4382, 34), (4382, 9))

Outputs

The .transform() method returns the original dataset with the correlated columns removed.

[11]:
X_train_reduced.head()
[11]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_order_items_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_num_distinct_transaction_per_account_number_7day is_existing_user_0 status_Pending is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
eid
867-8837095-9305559 0 0 0.0 1 0 0 0 0 1
974-5306287-3527394 0 0 0.0 1 0 0 0 0 1
584-0112844-9158928 0 0 0.0 1 0 0 0 0 1
956-4190732-7014837 0 0 0.0 1 0 0 0 0 1
349-7005645-8862067 0 0 0.0 1 0 0 0 0 1
[12]:
X_test_reduced.head()
[12]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_order_items_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_num_distinct_transaction_per_account_number_7day is_existing_user_0 status_Pending is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
eid
975-8351797-7122581 0 2 29.00 1 1 0 0 0 0
785-6259585-7858053 0 0 0.00 1 0 0 0 0 1
057-4039373-1790681 0 2 192.95 1 0 0 0 0 0
095-5263240-3834186 0 0 0.00 1 0 0 0 0 1
980-3802574-0009480 0 2 9.00 1 0 0 0 0 0

Find correlated features and transform the dataset (in one step)

You can also use the .fit_transform() method to identify the columns that should be kept, then remove the remaining correlated columns, all using the given dataset.

[13]:
agg = AgglomerativeClusteringReducer(**params)
X_train_reduced = agg.fit_transform(X=X_train)
[14]:
X_train_reduced.head()
[14]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_order_items_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_num_distinct_transaction_per_account_number_7day is_existing_user_0 status_Pending is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
eid
867-8837095-9305559 0 0 0.0 1 0 0 0 0 1
974-5306287-3527394 0 0 0.0 1 0 0 0 0 1
584-0112844-9158928 0 0 0.0 1 0 0 0 0 1
956-4190732-7014837 0 0 0.0 1 0 0 0 0 1
349-7005645-8862067 0 0 0.0 1 0 0 0 0 1
[15]:
X_train.shape, X_train_reduced.shape
[15]:
((8894, 32), (8894, 9))