Pairwise Metrics Example

This notebook contains an example of how pairwise metrics can be applied to a dataset, and how to create your own (which can be used in the correlation_reduction module).

Requirements

To run, you’ll need the following:

  • A dataset or rule set (in the case of a rule set, you need to provide the binary columns of the rules as applied to a dataset)


Import packages

[6]:
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.metrics.pairwise import CosineSimilarity, JaccardSimilarity

import pandas as pd

Read in data

Let’s read in some dummy data.

[7]:
X_train = pd.read_csv(
    'dummy_data/X_train.csv',
    index_col='eid'
)
X_test = pd.read_csv(
    'dummy_data/X_test.csv',
    index_col='eid'
)

Apply pairwise functions

The example applies two pairwise functions:

  • Cosine Similarity

  • Jaccard Similarity

Instantiate class and run fit method

We can run the .fit() method to calculate the pairwise metrics for the dataset.

Cosine Similarity

[9]:
cs = CosineSimilarity()
cs_sim_X = cs.fit(X_train)

Jaccard Similarity

[10]:
js = JaccardSimilarity()
js_sim_X = js.fit(X_train)

Outputs

The .fit() method returns the similarity matrix (stored as a dataframe) for the given pairwise function.

[11]:
cs_sim_X.head()
[11]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_fraud_transactions_per_account_number_1day account_number_num_fraud_transactions_per_account_number_30day account_number_num_fraud_transactions_per_account_number_90day account_number_num_order_items_per_account_number_lifetime num_order_items account_number_num_fraud_transactions_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_avg_order_total_per_account_number_1day account_number_sum_order_total_per_account_number_1day ... account_number_num_distinct_transaction_per_account_number_30day account_number_num_distinct_transaction_per_account_number_1day is_existing_user_1 is_existing_user_0 status_New status_Pending is_billing_shipping_city_same_1 is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
account_number_num_fraud_transactions_per_account_number_7day 1.000000 0.972840 0.979841 0.965618 0.251396 0.136544 0.965618 0.311291 0.308410 0.519526 ... 0.243355 0.245599 0.070108 0.195356 0.138927 0.000000 0.096764 0.161886 0.0 0.054647
account_number_num_fraud_transactions_per_account_number_1day 0.972840 1.000000 0.950742 0.938272 0.245633 0.136763 0.938272 0.315479 0.312326 0.506973 ... 0.236701 0.239188 0.064943 0.210653 0.139945 0.000000 0.096504 0.166329 0.0 0.054998
account_number_num_fraud_transactions_per_account_number_30day 0.979841 0.950742 1.000000 0.988812 0.248744 0.134339 0.988812 0.298962 0.295767 0.495504 ... 0.250815 0.242856 0.075526 0.183377 0.139403 0.000000 0.100213 0.151960 0.0 0.058136
account_number_num_fraud_transactions_per_account_number_90day 0.965618 0.938272 0.988812 1.000000 0.247159 0.131221 1.000000 0.294624 0.287911 0.482217 ... 0.254390 0.240396 0.077777 0.178410 0.139604 0.000000 0.101647 0.147843 0.0 0.061552
account_number_num_order_items_per_account_number_lifetime 0.251396 0.245633 0.248744 0.247159 1.000000 0.949821 0.247159 0.623486 0.619616 0.628797 ... 0.701810 0.680413 0.603815 0.295078 0.670448 0.014426 0.634955 0.217453 0.0 0.025678

5 rows × 32 columns

[12]:
js_sim_X.head()
[12]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_fraud_transactions_per_account_number_1day account_number_num_fraud_transactions_per_account_number_30day account_number_num_fraud_transactions_per_account_number_90day account_number_num_order_items_per_account_number_lifetime num_order_items account_number_num_fraud_transactions_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_avg_order_total_per_account_number_1day account_number_sum_order_total_per_account_number_1day ... account_number_num_distinct_transaction_per_account_number_30day account_number_num_distinct_transaction_per_account_number_1day is_existing_user_1 is_existing_user_0 status_New status_Pending is_billing_shipping_city_same_1 is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
account_number_num_fraud_transactions_per_account_number_7day 1.000000 1.000000 0.980000 0.968379 0.040685 0.039399 0.968379 0.040676 0.040580 0.040580 ... 0.027547 0.027547 0.009631 0.125740 0.027556 0.000000 0.018147 0.108820 0.0 0.015167
account_number_num_fraud_transactions_per_account_number_1day 1.000000 1.000000 0.980000 0.968379 0.040685 0.039399 0.968379 0.040676 0.040580 0.040580 ... 0.027547 0.027547 0.009631 0.125740 0.027556 0.000000 0.018147 0.108820 0.0 0.015167
account_number_num_fraud_transactions_per_account_number_30day 0.980000 0.980000 1.000000 0.988142 0.040871 0.039590 0.988142 0.040862 0.040770 0.040770 ... 0.028109 0.028109 0.010274 0.125276 0.028118 0.000000 0.018752 0.108200 0.0 0.016007
account_number_num_fraud_transactions_per_account_number_90day 0.968379 0.968379 0.988142 1.000000 0.041075 0.039563 1.000000 0.041065 0.040742 0.040742 ... 0.028446 0.028446 0.010659 0.125000 0.028456 0.000000 0.019114 0.107832 0.0 0.016639
account_number_num_order_items_per_account_number_lifetime 0.040685 0.040685 0.040871 0.041075 1.000000 0.973625 0.041075 0.999775 0.979040 0.979040 ... 0.498763 0.498763 0.440196 0.151582 0.498594 0.000451 0.470033 0.085191 0.0 0.013156

5 rows × 32 columns

The .fit method can be fed into the AgglomerativeClusteringFeatureReduction class as the similarity_function parameter. This method is then used to calculate the similarity matrix for the dataset, which in turn is used to remove similar columns.

You can also create your own class for the pairwise metric you would like to apply. The .fit method of this class can then be passed to the AgglomerativeClusteringFeatureReduction class as the similarity_function parameter - it will be used to calculate pairwise similarity of the dataset.


Creating your own similarity function

Say we want to create a class which calculates the absolute (between 0 and 1) Spearman’s correlation for a dataset. This class can then be used in the AgglomerativeClusteringFeatureReduction class to remove similar columns based on this metric.

The main class structure involves having a .fit() method which has one argument - the dataset on which the similarity metric should be applied. This method should return a dataframe containing the similarity matrix calculated (stored as a dataframe).

[13]:
from scipy.stats import spearmanr

class SpearmansCorrelation:

    def fit(self, X: pd.DataFrame) -> pd.DataFrame:
        # Calculate pairwise Spearman's Correlation
        spearman_sim_matrix = abs(spearmanr(X_train)[0])
        # Return dataframe of the Spearman's Correlation matrix
        return pd.DataFrame(spearman_sim_matrix, index=X_train.columns, columns=X_train.columns)

We can then apply the .fit() method to the dataset to check it works:

[14]:
sc = SpearmansCorrelation()
sc_sim_X = sc.fit(X_train)
[15]:
sc_sim_X.head()
[15]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_fraud_transactions_per_account_number_1day account_number_num_fraud_transactions_per_account_number_30day account_number_num_fraud_transactions_per_account_number_90day account_number_num_order_items_per_account_number_lifetime num_order_items account_number_num_fraud_transactions_per_account_number_lifetime account_number_avg_order_total_per_account_number_30day account_number_avg_order_total_per_account_number_1day account_number_sum_order_total_per_account_number_1day ... account_number_num_distinct_transaction_per_account_number_30day account_number_num_distinct_transaction_per_account_number_1day is_existing_user_1 is_existing_user_0 status_New status_Pending is_billing_shipping_city_same_1 is_billing_shipping_city_same_0 num_order_items_IsNull order_total_IsNull
account_number_num_fraud_transactions_per_account_number_7day 1.000000 0.999981 0.989662 0.983680 0.092939 0.063027 0.983680 0.160873 0.155962 0.159773 ... 0.125170 0.199662 0.262574 0.262574 0.003091 0.003091 0.188288 0.188288 0.001785 0.074294
account_number_num_fraud_transactions_per_account_number_1day 0.999981 1.000000 0.989644 0.983662 0.092792 0.063026 0.983662 0.160827 0.155904 0.159683 ... 0.124436 0.198276 0.263065 0.263065 0.003091 0.003091 0.188322 0.188322 0.001785 0.074302
account_number_num_fraud_transactions_per_account_number_30day 0.989662 0.989644 1.000000 0.993950 0.090782 0.061059 0.993950 0.157429 0.152548 0.156320 ... 0.136799 0.197178 0.258281 0.258281 0.003124 0.003124 0.185198 0.185198 0.001803 0.071591
account_number_num_fraud_transactions_per_account_number_90day 0.983680 0.983662 0.993950 1.000000 0.089508 0.058752 1.000000 0.155894 0.149761 0.153511 ... 0.143383 0.195724 0.255885 0.255885 0.003143 0.003143 0.183472 0.183472 0.001814 0.069163
account_number_num_order_items_per_account_number_lifetime 0.092939 0.092792 0.090782 0.089508 1.000000 0.963531 0.089508 0.906594 0.891879 0.894135 ... 0.176164 0.086075 0.079853 0.079853 0.003643 0.003643 0.036935 0.036935 0.010185 0.937286

5 rows × 32 columns

Finally, after instantiating the class, we can feed the .fit method to the similarity_function parameter of the AgglomerativeClusteringFeatureReduction class to use Spearman’s Correlation as the similarity metric when removing similar columns from a dataset. Note that we don’t call the .fit method when passing it.

[16]:
params = {
    'threshold': 0.75,
    'strategy': 'bottom_up',
    'similarity_function': sc.fit,
    'columns_performance': None,
}
[17]:
fr = AgglomerativeClusteringReducer(**params)
[18]:
X_train_reduced = fr.fit_transform(X_train)
[19]:
X_train_reduced.head()
[19]:
account_number_num_fraud_transactions_per_account_number_7day account_number_num_order_items_per_account_number_lifetime account_number_num_distinct_transaction_per_account_number_7day order_total account_number_num_distinct_transaction_per_account_number_90day account_number_num_distinct_transaction_per_account_number_1day is_existing_user_1 status_New is_billing_shipping_city_same_1 num_order_items_IsNull
eid
867-8837095-9305559 0 0 1 160.202697 1 1 1 1 1 0
974-5306287-3527394 0 0 1 160.202697 1 1 1 1 1 0
584-0112844-9158928 0 0 1 160.202697 1 1 1 1 1 0
956-4190732-7014837 0 0 1 160.202697 1 1 1 1 1 0
349-7005645-8862067 0 0 1 160.202697 1 1 1 1 1 0
[20]:
X_train.shape, X_train_reduced.shape
[20]:
((8894, 32), (8894, 10))