Pairwise Metrics Example¶
This notebook contains an example of how pairwise metrics can be applied to a dataset, and how to create your own (which can be used in the correlation_reduction module).
Requirements¶
To run, you’ll need the following:
A dataset or rule set (in the case of a rule set, you need to provide the binary columns of the rules as applied to a dataset)
Import packages¶
[6]:
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.metrics.pairwise import CosineSimilarity, JaccardSimilarity
import pandas as pd
Read in data¶
Let’s read in some dummy data.
[7]:
X_train = pd.read_csv(
'dummy_data/X_train.csv',
index_col='eid'
)
X_test = pd.read_csv(
'dummy_data/X_test.csv',
index_col='eid'
)
Apply pairwise functions¶
The example applies two pairwise functions:
Cosine Similarity
Jaccard Similarity
Instantiate class and run fit method¶
We can run the .fit() method to calculate the pairwise metrics for the dataset.
Cosine Similarity¶
[9]:
cs = CosineSimilarity()
cs_sim_X = cs.fit(X_train)
Jaccard Similarity¶
[10]:
js = JaccardSimilarity()
js_sim_X = js.fit(X_train)
Outputs¶
The .fit() method returns the similarity matrix (stored as a dataframe) for the given pairwise function.
[11]:
cs_sim_X.head()
[11]:
account_number_num_fraud_transactions_per_account_number_7day | account_number_num_fraud_transactions_per_account_number_1day | account_number_num_fraud_transactions_per_account_number_30day | account_number_num_fraud_transactions_per_account_number_90day | account_number_num_order_items_per_account_number_lifetime | num_order_items | account_number_num_fraud_transactions_per_account_number_lifetime | account_number_avg_order_total_per_account_number_30day | account_number_avg_order_total_per_account_number_1day | account_number_sum_order_total_per_account_number_1day | ... | account_number_num_distinct_transaction_per_account_number_30day | account_number_num_distinct_transaction_per_account_number_1day | is_existing_user_1 | is_existing_user_0 | status_New | status_Pending | is_billing_shipping_city_same_1 | is_billing_shipping_city_same_0 | num_order_items_IsNull | order_total_IsNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
account_number_num_fraud_transactions_per_account_number_7day | 1.000000 | 0.972840 | 0.979841 | 0.965618 | 0.251396 | 0.136544 | 0.965618 | 0.311291 | 0.308410 | 0.519526 | ... | 0.243355 | 0.245599 | 0.070108 | 0.195356 | 0.138927 | 0.000000 | 0.096764 | 0.161886 | 0.0 | 0.054647 |
account_number_num_fraud_transactions_per_account_number_1day | 0.972840 | 1.000000 | 0.950742 | 0.938272 | 0.245633 | 0.136763 | 0.938272 | 0.315479 | 0.312326 | 0.506973 | ... | 0.236701 | 0.239188 | 0.064943 | 0.210653 | 0.139945 | 0.000000 | 0.096504 | 0.166329 | 0.0 | 0.054998 |
account_number_num_fraud_transactions_per_account_number_30day | 0.979841 | 0.950742 | 1.000000 | 0.988812 | 0.248744 | 0.134339 | 0.988812 | 0.298962 | 0.295767 | 0.495504 | ... | 0.250815 | 0.242856 | 0.075526 | 0.183377 | 0.139403 | 0.000000 | 0.100213 | 0.151960 | 0.0 | 0.058136 |
account_number_num_fraud_transactions_per_account_number_90day | 0.965618 | 0.938272 | 0.988812 | 1.000000 | 0.247159 | 0.131221 | 1.000000 | 0.294624 | 0.287911 | 0.482217 | ... | 0.254390 | 0.240396 | 0.077777 | 0.178410 | 0.139604 | 0.000000 | 0.101647 | 0.147843 | 0.0 | 0.061552 |
account_number_num_order_items_per_account_number_lifetime | 0.251396 | 0.245633 | 0.248744 | 0.247159 | 1.000000 | 0.949821 | 0.247159 | 0.623486 | 0.619616 | 0.628797 | ... | 0.701810 | 0.680413 | 0.603815 | 0.295078 | 0.670448 | 0.014426 | 0.634955 | 0.217453 | 0.0 | 0.025678 |
5 rows × 32 columns
[12]:
js_sim_X.head()
[12]:
account_number_num_fraud_transactions_per_account_number_7day | account_number_num_fraud_transactions_per_account_number_1day | account_number_num_fraud_transactions_per_account_number_30day | account_number_num_fraud_transactions_per_account_number_90day | account_number_num_order_items_per_account_number_lifetime | num_order_items | account_number_num_fraud_transactions_per_account_number_lifetime | account_number_avg_order_total_per_account_number_30day | account_number_avg_order_total_per_account_number_1day | account_number_sum_order_total_per_account_number_1day | ... | account_number_num_distinct_transaction_per_account_number_30day | account_number_num_distinct_transaction_per_account_number_1day | is_existing_user_1 | is_existing_user_0 | status_New | status_Pending | is_billing_shipping_city_same_1 | is_billing_shipping_city_same_0 | num_order_items_IsNull | order_total_IsNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
account_number_num_fraud_transactions_per_account_number_7day | 1.000000 | 1.000000 | 0.980000 | 0.968379 | 0.040685 | 0.039399 | 0.968379 | 0.040676 | 0.040580 | 0.040580 | ... | 0.027547 | 0.027547 | 0.009631 | 0.125740 | 0.027556 | 0.000000 | 0.018147 | 0.108820 | 0.0 | 0.015167 |
account_number_num_fraud_transactions_per_account_number_1day | 1.000000 | 1.000000 | 0.980000 | 0.968379 | 0.040685 | 0.039399 | 0.968379 | 0.040676 | 0.040580 | 0.040580 | ... | 0.027547 | 0.027547 | 0.009631 | 0.125740 | 0.027556 | 0.000000 | 0.018147 | 0.108820 | 0.0 | 0.015167 |
account_number_num_fraud_transactions_per_account_number_30day | 0.980000 | 0.980000 | 1.000000 | 0.988142 | 0.040871 | 0.039590 | 0.988142 | 0.040862 | 0.040770 | 0.040770 | ... | 0.028109 | 0.028109 | 0.010274 | 0.125276 | 0.028118 | 0.000000 | 0.018752 | 0.108200 | 0.0 | 0.016007 |
account_number_num_fraud_transactions_per_account_number_90day | 0.968379 | 0.968379 | 0.988142 | 1.000000 | 0.041075 | 0.039563 | 1.000000 | 0.041065 | 0.040742 | 0.040742 | ... | 0.028446 | 0.028446 | 0.010659 | 0.125000 | 0.028456 | 0.000000 | 0.019114 | 0.107832 | 0.0 | 0.016639 |
account_number_num_order_items_per_account_number_lifetime | 0.040685 | 0.040685 | 0.040871 | 0.041075 | 1.000000 | 0.973625 | 0.041075 | 0.999775 | 0.979040 | 0.979040 | ... | 0.498763 | 0.498763 | 0.440196 | 0.151582 | 0.498594 | 0.000451 | 0.470033 | 0.085191 | 0.0 | 0.013156 |
5 rows × 32 columns
The .fit method can be fed into the AgglomerativeClusteringFeatureReduction class as the similarity_function parameter. This method is then used to calculate the similarity matrix for the dataset, which in turn is used to remove similar columns.
You can also create your own class for the pairwise metric you would like to apply. The .fit method of this class can then be passed to the AgglomerativeClusteringFeatureReduction class as the similarity_function parameter - it will be used to calculate pairwise similarity of the dataset.
Creating your own similarity function¶
Say we want to create a class which calculates the absolute (between 0 and 1) Spearman’s correlation for a dataset. This class can then be used in the AgglomerativeClusteringFeatureReduction class to remove similar columns based on this metric.
The main class structure involves having a .fit() method which has one argument - the dataset on which the similarity metric should be applied. This method should return a dataframe containing the similarity matrix calculated (stored as a dataframe).
[13]:
from scipy.stats import spearmanr
class SpearmansCorrelation:
def fit(self, X: pd.DataFrame) -> pd.DataFrame:
# Calculate pairwise Spearman's Correlation
spearman_sim_matrix = abs(spearmanr(X_train)[0])
# Return dataframe of the Spearman's Correlation matrix
return pd.DataFrame(spearman_sim_matrix, index=X_train.columns, columns=X_train.columns)
We can then apply the .fit() method to the dataset to check it works:
[14]:
sc = SpearmansCorrelation()
sc_sim_X = sc.fit(X_train)
[15]:
sc_sim_X.head()
[15]:
account_number_num_fraud_transactions_per_account_number_7day | account_number_num_fraud_transactions_per_account_number_1day | account_number_num_fraud_transactions_per_account_number_30day | account_number_num_fraud_transactions_per_account_number_90day | account_number_num_order_items_per_account_number_lifetime | num_order_items | account_number_num_fraud_transactions_per_account_number_lifetime | account_number_avg_order_total_per_account_number_30day | account_number_avg_order_total_per_account_number_1day | account_number_sum_order_total_per_account_number_1day | ... | account_number_num_distinct_transaction_per_account_number_30day | account_number_num_distinct_transaction_per_account_number_1day | is_existing_user_1 | is_existing_user_0 | status_New | status_Pending | is_billing_shipping_city_same_1 | is_billing_shipping_city_same_0 | num_order_items_IsNull | order_total_IsNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
account_number_num_fraud_transactions_per_account_number_7day | 1.000000 | 0.999981 | 0.989662 | 0.983680 | 0.092939 | 0.063027 | 0.983680 | 0.160873 | 0.155962 | 0.159773 | ... | 0.125170 | 0.199662 | 0.262574 | 0.262574 | 0.003091 | 0.003091 | 0.188288 | 0.188288 | 0.001785 | 0.074294 |
account_number_num_fraud_transactions_per_account_number_1day | 0.999981 | 1.000000 | 0.989644 | 0.983662 | 0.092792 | 0.063026 | 0.983662 | 0.160827 | 0.155904 | 0.159683 | ... | 0.124436 | 0.198276 | 0.263065 | 0.263065 | 0.003091 | 0.003091 | 0.188322 | 0.188322 | 0.001785 | 0.074302 |
account_number_num_fraud_transactions_per_account_number_30day | 0.989662 | 0.989644 | 1.000000 | 0.993950 | 0.090782 | 0.061059 | 0.993950 | 0.157429 | 0.152548 | 0.156320 | ... | 0.136799 | 0.197178 | 0.258281 | 0.258281 | 0.003124 | 0.003124 | 0.185198 | 0.185198 | 0.001803 | 0.071591 |
account_number_num_fraud_transactions_per_account_number_90day | 0.983680 | 0.983662 | 0.993950 | 1.000000 | 0.089508 | 0.058752 | 1.000000 | 0.155894 | 0.149761 | 0.153511 | ... | 0.143383 | 0.195724 | 0.255885 | 0.255885 | 0.003143 | 0.003143 | 0.183472 | 0.183472 | 0.001814 | 0.069163 |
account_number_num_order_items_per_account_number_lifetime | 0.092939 | 0.092792 | 0.090782 | 0.089508 | 1.000000 | 0.963531 | 0.089508 | 0.906594 | 0.891879 | 0.894135 | ... | 0.176164 | 0.086075 | 0.079853 | 0.079853 | 0.003643 | 0.003643 | 0.036935 | 0.036935 | 0.010185 | 0.937286 |
5 rows × 32 columns
Finally, after instantiating the class, we can feed the .fit method to the similarity_function parameter of the AgglomerativeClusteringFeatureReduction class to use Spearman’s Correlation as the similarity metric when removing similar columns from a dataset. Note that we don’t call the .fit method when passing it.
[16]:
params = {
'threshold': 0.75,
'strategy': 'bottom_up',
'similarity_function': sc.fit,
'columns_performance': None,
}
[17]:
fr = AgglomerativeClusteringReducer(**params)
[18]:
X_train_reduced = fr.fit_transform(X_train)
[19]:
X_train_reduced.head()
[19]:
account_number_num_fraud_transactions_per_account_number_7day | account_number_num_order_items_per_account_number_lifetime | account_number_num_distinct_transaction_per_account_number_7day | order_total | account_number_num_distinct_transaction_per_account_number_90day | account_number_num_distinct_transaction_per_account_number_1day | is_existing_user_1 | status_New | is_billing_shipping_city_same_1 | num_order_items_IsNull | |
---|---|---|---|---|---|---|---|---|---|---|
eid | ||||||||||
867-8837095-9305559 | 0 | 0 | 1 | 160.202697 | 1 | 1 | 1 | 1 | 1 | 0 |
974-5306287-3527394 | 0 | 0 | 1 | 160.202697 | 1 | 1 | 1 | 1 | 1 | 0 |
584-0112844-9158928 | 0 | 0 | 1 | 160.202697 | 1 | 1 | 1 | 1 | 1 | 0 |
956-4190732-7014837 | 0 | 0 | 1 | 160.202697 | 1 | 1 | 1 | 1 | 1 | 0 |
349-7005645-8862067 | 0 | 0 | 1 | 160.202697 | 1 | 1 | 1 | 1 | 1 | 0 |
[20]:
X_train.shape, X_train_reduced.shape
[20]:
((8894, 32), (8894, 10))