CorrelatedFilter Example¶
This notebook contains an example of how the CorrelatedFilter class can be used to keep only those rules which are uncorrelated.
Requirements¶
To run, you’ll need the following:
A rule set (specifically the binary columns of the rules as applied to a dataset).
Import packages¶
[20]:
from iguanas.rule_selection import CorrelatedFilter
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.metrics.pairwise import JaccardSimilarity
import pandas as pd
Read in data¶
Let’s read in some dummy rules (stored as binary columns) and their corresponding performance metric dataframes:
[21]:
X_rules_train = pd.read_csv(
'dummy_data/X_rules_train.csv',
index_col='eid'
)
X_rules_test = pd.read_csv(
'dummy_data/X_rules_test.csv',
index_col='eid'
)
rule_descriptions_train = pd.read_csv(
'dummy_data/rule_descriptions_train.csv',
index_col='Rule'
)
rule_descriptions_test = pd.read_csv(
'dummy_data/rule_descriptions_test.csv',
index_col='Rule'
)
Calculate uncorrelated rules¶
Firstly, we need to instantiate the class which will perform the correlation reduction. See the correlation_reduction module for more information. We’ll be using the AgglomerativeClusteringReducer class from that module.
To instantiate the AgglomerativeClusteringReducer class, we need to first choose our similarity function. See the metrics.pairwise module for more information. In this example, we’ll use the Jaccard similarity:
[22]:
js = JaccardSimilarity()
Now we can instantiate the AgglomerativeClusteringReducer class with the necessary parameters. See the class docstring for more information regarding these:
[23]:
params = {
'threshold': 0.1,
'strategy': 'bottom_up',
'similarity_function': js.fit,
'columns_performance': rule_descriptions_train['OptMetric']
}
[24]:
agg_clust = AgglomerativeClusteringReducer(**params)
Finally, we can instantiate the CorrelatedFilter class and run the .fit() method. Note that we only need to provide rule_descriptions if we want to also filter rules from that dataframe (this is a standard dataframe object created when generating/optimising rules):
[25]:
fcr = CorrelatedFilter(
correlation_reduction_class=agg_clust,
rule_descriptions=rule_descriptions_train
)
[26]:
fcr.fit(X_rules=X_rules_train)
Outputs¶
The .fit() method does not return anything. However it does create the following attribute:
rules_to_keep: The list of uncorrelated rules.
[27]:
fcr.rules_to_keep
[27]:
['Rule1', 'Rule2', 'Rule3', 'Rule4']
Drop rules which are correlated¶
Use the .transform() method to drop the rules which are correlated from a given dataset:
[28]:
X_rules_test_uncorr = fcr.transform(X_rules=X_rules_test)
Outputs¶
The .transform() method returns a dataframe with the correlated rules dropped. It also filters the provided rule_descriptions dataframe and saves as a class attribute with the same name:
[29]:
X_rules_test_uncorr.head()
[29]:
Rule1 | Rule2 | Rule3 | Rule4 | |
---|---|---|---|---|
eid | ||||
0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 1 |
[30]:
fcr.rule_descriptions
[30]:
Precision | Recall | PercDataFlagged | OptMetric | |
---|---|---|---|---|
Rule | ||||
Rule1 | 1.000000 | 0.3 | 0.006 | NaN |
Rule2 | 1.000000 | 0.3 | 0.006 | NaN |
Rule3 | 1.000000 | 0.3 | 0.006 | NaN |
Rule4 | 0.018036 | 0.9 | 0.998 | NaN |
Calculate correlated rules and drop them from a dataset (in one step)¶
You can also use the fit_transform() method to calculate correlated rules and drop them from a dataset (in one step):
[31]:
agg_clust = AgglomerativeClusteringReducer(**params)
[32]:
fcr = CorrelatedFilter(
correlation_reduction_class=agg_clust,
rule_descriptions=rule_descriptions_train
)
[33]:
X_rules_train_uncorr = fcr.fit_transform(X_rules=X_rules_train)
Outputs¶
The .fit_transform() method returns a dataframe with the correlated rules dropped, while filtering the provided rule_descriptions dataframe and saving it as a class attribute with the same name. It also creates the following attribute:
rules_to_keep: The list of rules which are uncorrelated.
[34]:
fcr.rules_to_keep
[34]:
['Rule1', 'Rule2', 'Rule3', 'Rule4']
[35]:
X_rules_train_uncorr.head()
[35]:
Rule1 | Rule2 | Rule3 | Rule4 | |
---|---|---|---|---|
eid | ||||
0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 1 |
[36]:
fcr.rule_descriptions
[36]:
Precision | Recall | PercDataFlagged | OptMetric | |
---|---|---|---|---|
Rule | ||||
Rule1 | 1.000000 | 0.3 | 0.006 | NaN |
Rule2 | 1.000000 | 0.3 | 0.006 | NaN |
Rule3 | 1.000000 | 0.3 | 0.006 | NaN |
Rule4 | 0.018036 | 0.9 | 0.998 | NaN |