CorrelatedFilter Example

This notebook contains an example of how the CorrelatedFilter class can be used to keep only those rules which are uncorrelated.

Requirements

To run, you’ll need the following:

  • A rule set (specifically the binary columns of the rules as applied to a dataset).


Import packages

[20]:
from iguanas.rule_selection import CorrelatedFilter
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.metrics.pairwise import JaccardSimilarity

import pandas as pd

Read in data

Let’s read in some dummy rules (stored as binary columns) and their corresponding performance metric dataframes:

[21]:
X_rules_train = pd.read_csv(
    'dummy_data/X_rules_train.csv',
    index_col='eid'
)
X_rules_test = pd.read_csv(
    'dummy_data/X_rules_test.csv',
    index_col='eid'
)
rule_descriptions_train = pd.read_csv(
    'dummy_data/rule_descriptions_train.csv',
    index_col='Rule'
)
rule_descriptions_test = pd.read_csv(
    'dummy_data/rule_descriptions_test.csv',
    index_col='Rule'
)

Calculate uncorrelated rules

Firstly, we need to instantiate the class which will perform the correlation reduction. See the correlation_reduction module for more information. We’ll be using the AgglomerativeClusteringReducer class from that module.

To instantiate the AgglomerativeClusteringReducer class, we need to first choose our similarity function. See the metrics.pairwise module for more information. In this example, we’ll use the Jaccard similarity:

[22]:
js = JaccardSimilarity()

Now we can instantiate the AgglomerativeClusteringReducer class with the necessary parameters. See the class docstring for more information regarding these:

[23]:
params = {
    'threshold': 0.1,
    'strategy': 'bottom_up',
    'similarity_function': js.fit,
    'columns_performance': rule_descriptions_train['OptMetric']
}
[24]:
agg_clust = AgglomerativeClusteringReducer(**params)

Finally, we can instantiate the CorrelatedFilter class and run the .fit() method. Note that we only need to provide rule_descriptions if we want to also filter rules from that dataframe (this is a standard dataframe object created when generating/optimising rules):

[25]:
fcr = CorrelatedFilter(
    correlation_reduction_class=agg_clust,
    rule_descriptions=rule_descriptions_train
)
[26]:
fcr.fit(X_rules=X_rules_train)

Outputs

The .fit() method does not return anything. However it does create the following attribute:

  • rules_to_keep: The list of uncorrelated rules.

[27]:
fcr.rules_to_keep
[27]:
['Rule1', 'Rule2', 'Rule3', 'Rule4']

Drop rules which are correlated

Use the .transform() method to drop the rules which are correlated from a given dataset:

[28]:
X_rules_test_uncorr = fcr.transform(X_rules=X_rules_test)

Outputs

The .transform() method returns a dataframe with the correlated rules dropped. It also filters the provided rule_descriptions dataframe and saves as a class attribute with the same name:

[29]:
X_rules_test_uncorr.head()
[29]:
Rule1 Rule2 Rule3 Rule4
eid
0 0 0 0 1
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1
[30]:
fcr.rule_descriptions
[30]:
Precision Recall PercDataFlagged OptMetric
Rule
Rule1 1.000000 0.3 0.006 NaN
Rule2 1.000000 0.3 0.006 NaN
Rule3 1.000000 0.3 0.006 NaN
Rule4 0.018036 0.9 0.998 NaN

Calculate correlated rules and drop them from a dataset (in one step)

You can also use the fit_transform() method to calculate correlated rules and drop them from a dataset (in one step):

[31]:
agg_clust = AgglomerativeClusteringReducer(**params)
[32]:
fcr = CorrelatedFilter(
    correlation_reduction_class=agg_clust,
    rule_descriptions=rule_descriptions_train
)
[33]:
X_rules_train_uncorr = fcr.fit_transform(X_rules=X_rules_train)

Outputs

The .fit_transform() method returns a dataframe with the correlated rules dropped, while filtering the provided rule_descriptions dataframe and saving it as a class attribute with the same name. It also creates the following attribute:

  • rules_to_keep: The list of rules which are uncorrelated.

[34]:
fcr.rules_to_keep
[34]:
['Rule1', 'Rule2', 'Rule3', 'Rule4']
[35]:
X_rules_train_uncorr.head()
[35]:
Rule1 Rule2 Rule3 Rule4
eid
0 0 0 0 1
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1
[36]:
fcr.rule_descriptions
[36]:
Precision Recall PercDataFlagged OptMetric
Rule
Rule1 1.000000 0.3 0.006 NaN
Rule2 1.000000 0.3 0.006 NaN
Rule3 1.000000 0.3 0.006 NaN
Rule4 0.018036 0.9 0.998 NaN