Grid Search CV Example¶
This notebook contains an example of how the Grid Search CV module can be used to search for the best parameters when generating or optimising a set of rules.
The Grid Search CV process is as follows: * Generate k-fold stratified cross validation datasets. * Generate unique sets of rule generation/optimisation parameters from the given search space. * For each of the training and validation datasets: * Train/optimise a set of rules using each unique set of parameters (using the training dataset). * Use the GreedyFilter class (in the rule_selection module) to calculate the maximum overall performance of each rule set on the validation dataset. See the class’ example notebook for more information. * Return the parameters which generated the highest mean overall performance across the validation datasets.
Requirements¶
To run, you’ll need the following:
A labelled, processed dataset (nulls imputed, categorical features encoded).
Import packages¶
[1]:
from iguanas.rule_selection import GridSearchCV
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.metrics.classification import FScore
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from hyperopt import tpe, anneal
Read in data¶
Let’s read in some labelled, processed dummy data.
[2]:
X_train = pd.read_csv(
'dummy_data/X_train_gs.csv',
index_col='eid'
)
y_train = pd.read_csv(
'dummy_data/y_train_gs.csv',
index_col='eid'
).squeeze()
X_test = pd.read_csv(
'dummy_data/X_test_gs.csv',
index_col='eid'
)
y_test = pd.read_csv(
'dummy_data/y_test_gs.csv',
index_col='eid'
).squeeze()
Generate rules using GridSearchCV¶
We can use the GridSearchCV class to implement stratifield k-fold cross validation when searching for the best rule generation parameters. This allows us to find the best rule generation parameters whilst also reducing the likelihood of overfitting.
Set up class parameters¶
We first need to define the search values for each parameter in the provided rule generation class. We define these in a dictionary, where the dictionary keys are the relevant rule generation parameters and the dictionary values are lists of search values for each parameter. The GridSearchCV class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.
Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.
[3]:
f1 = FScore(beta=1)
[4]:
param_grid = {
'opt_func': [f1.fit],
'n_total_conditions': [1, 4],
'tree_ensemble': [
RandomForestClassifier(n_estimators=5, random_state=0),
RandomForestClassifier(n_estimators=15, random_state=0),
],
'target_feat_corr_types': [None, 'Infer']
}
Now that we have our search values, we can define the rest of the GridSearchCV class parameters. Note here that we are splitting the data into 3 folds for training/validation.
Please see the class docstring for more information on each parameter
[5]:
params = {
'rule_class': RuleGeneratorDT,
'param_grid': param_grid,
'greedy_filter_opt_func': f1.fit,
'cv': 3,
'num_cores': 4,
'verbose': 1
}
Instantiate class and run fit method¶
Once the parameters have been set, we can run the .fit() method to search for the best rule generation parameters.
[6]:
gs_cv = GridSearchCV(**params)
8 unique parameter sets
[7]:
gs_cv.fit(
X=X_train,
y=y_train
)
--- Fitting and validating rules using folds ---
100%|██████████| 24/24 [00:04<00:00, 5.31it/s]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---
Outputs¶
The .fit() method does not return any objects. However, it does generate the following useful attributes:
rule_strings (Dict[str, str]): The rules which achieved the best combined performance, defined using the standard Iguanas string format (values) and their names (keys).
param_results_per_fold (pd.DataFrame): Shows the best combined rule performance observed for each parameter set and fold.
param_results_aggregated (pd.DataFrame): Shows the mean and the standard deviation of the best combined rule performance, calculated across the folds, for each parameter set.
best_perf (float): The best combined rule performance achieved.
best_params (dict): The parameter set that achieved the best combined rule performance.
[8]:
gs_cv.param_results_per_fold.head()
[8]:
opt_func | n_total_conditions | tree_ensemble | target_feat_corr_types | Performance | ||
---|---|---|---|---|---|---|
Fold | ParamSetIndex | |||||
0 | 0 | <bound method FScore.fit of FScore with beta=1> | 1 | RandomForestClassifier(n_estimators=5, random_... | None | 0.511278 |
1 | <bound method FScore.fit of FScore with beta=1> | 1 | RandomForestClassifier(n_estimators=5, random_... | Infer | 0.511278 | |
2 | <bound method FScore.fit of FScore with beta=1> | 1 | (DecisionTreeClassifier(max_depth=4, max_featu... | None | 0.532258 | |
3 | <bound method FScore.fit of FScore with beta=1> | 1 | (DecisionTreeClassifier(max_depth=4, max_featu... | Infer | 0.532258 | |
4 | <bound method FScore.fit of FScore with beta=1> | 4 | RandomForestClassifier(n_estimators=5, random_... | None | 0.614458 |
[9]:
gs_cv.param_results_aggregated.head()
[9]:
opt_func | n_total_conditions | tree_ensemble | target_feat_corr_types | PerformancePerFold | MeanPerformance | StdDevPerformance | |
---|---|---|---|---|---|---|---|
ParamSetIndex | |||||||
0 | <bound method FScore.fit of FScore with beta=1> | 1 | RandomForestClassifier(n_estimators=5, random_... | None | [0.5112781954887218, 0.35185185185185186, 0.38... | 0.417464 | 0.068072 |
1 | <bound method FScore.fit of FScore with beta=1> | 1 | RandomForestClassifier(n_estimators=5, random_... | Infer | [0.5112781954887218, 0.35185185185185186, 0.38... | 0.417464 | 0.068072 |
2 | <bound method FScore.fit of FScore with beta=1> | 1 | (DecisionTreeClassifier(max_depth=4, max_featu... | None | [0.532258064516129, 0.38961038961038963, 0.402... | 0.441582 | 0.064346 |
3 | <bound method FScore.fit of FScore with beta=1> | 1 | (DecisionTreeClassifier(max_depth=4, max_featu... | Infer | [0.532258064516129, 0.38961038961038963, 0.402... | 0.441582 | 0.064346 |
4 | <bound method FScore.fit of FScore with beta=1> | 4 | RandomForestClassifier(n_estimators=5, random_... | None | [0.6144578313253012, 0.538860103626943, 0.5] | 0.551106 | 0.047523 |
[10]:
gs_cv.best_perf
[10]:
0.5829592533675241
[11]:
gs_cv.best_params
[11]:
{'opt_func': <bound method FScore.fit of FScore with beta=1>,
'n_total_conditions': 4,
'tree_ensemble': RandomForestClassifier(max_depth=4, n_estimators=15, random_state=0),
'target_feat_corr_types': 'Infer'}
We can also plot the overall performance of each set of rules (generated using each parameter set) on each of the validation datasets:
[12]:
gs_cv.plot_top_n_performance_by_fold(figsize=(8, 4))
Apply rules to a separate dataset¶
Use the .transform() method to apply the best performing rules overall to a separate dataset.
[13]:
X_rules_test = gs_cv.transform(
X=X_test,
y=y_test,
sample_weight=None
)
Outputs¶
The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.
A useful attribute created by running the .transform() method is:
rule_descriptions: A dataframe showing the logic of the generated rules and their performance metrics as applied to the given dataset.
[14]:
gs_cv.rule_descriptions.head()
[14]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
RGDT_Rule_20211202_100 | 0.847826 | 0.336207 | 0.010497 | 0.481481 | (X['account_number_sum_order_total_per_account... | 3 |
RGDT_Rule_20211202_6 | 0.860465 | 0.318966 | 0.009813 | 0.465409 | (X['account_number_avg_order_total_per_account... | 2 |
RGDT_Rule_20211202_104 | 0.860465 | 0.318966 | 0.009813 | 0.465409 | (X['account_number_sum_order_total_per_account... | 2 |
RGDT_Rule_20211202_39 | 0.837838 | 0.267241 | 0.008444 | 0.405229 | (X['account_number_avg_order_total_per_account... | 2 |
RGDT_Rule_20211202_112 | 0.852941 | 0.250000 | 0.007759 | 0.386667 | (X['account_number_sum_order_total_per_account... | 2 |
[15]:
X_rules_test.head()
[15]:
Rule | RGDT_Rule_20211202_100 | RGDT_Rule_20211202_6 | RGDT_Rule_20211202_104 | RGDT_Rule_20211202_39 | RGDT_Rule_20211202_112 | RGDT_Rule_20211202_46 | RGDT_Rule_20211202_97 | RGDT_Rule_20211202_121 | RGDT_Rule_20211202_45 | RGDT_Rule_20211202_23 | ... | RGDT_Rule_20211202_95 | RGDT_Rule_20211202_94 | RGDT_Rule_20211202_16 | RGDT_Rule_20211202_88 | RGDT_Rule_20211202_8 | RGDT_Rule_20211202_43 | RGDT_Rule_20211202_65 | RGDT_Rule_20211202_15 | RGDT_Rule_20211202_41 | RGDT_Rule_20211202_30 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
eid | |||||||||||||||||||||
975-8351797-7122581 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
785-6259585-7858053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
057-4039373-1790681 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
095-5263240-3834186 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
980-3802574-0009480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 45 columns
Optimise rules using GridSearchCV¶
We can use the GridSearchCV class to implement stratifield k-fold cross validation when searching for the best rule optimisation parameters. This allows us to find the best rule optimisation parameters whilst also reducing the likelihood of overfitting.
Set up class parameters¶
We firstly need to define the search values for each parameter in the provided rule optimisation class. We define these in a dictionary, where the dictionary keys are the relevant rule optimisation parameters and the dictionary values are lists of search values for each parameter. The GridSearchCV class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.
Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.
[16]:
f0dot5 = FScore(beta=0.5)
f1 = FScore(beta=1)
In this example, we’ll optimise the rules we generated in the previous Grid Search exercise. To do this, we need to convert the generated rules into the standard Iguanas lambda expression format:
[17]:
rule_lambdas = gs_cv.as_rule_lambdas(
as_numpy=False,
with_kwargs=True
)
lambda_kwargs = gs_cv.lambda_kwargs
[18]:
param_grid = {
'rule_lambdas': [rule_lambdas],
'lambda_kwargs': [lambda_kwargs],
'opt_func': [f0dot5.fit, f1.fit],
'n_iter': [30],
'algorithm': [tpe.suggest, anneal.suggest],
'verbose': [0]
}
Now that we have our search values, we can define the rest of the GridSearchCV class parameters. Note here that we are splitting the data into 3 folds for training/validation.
Please see the class docstring for more information on each parameter
[19]:
params = {
'rule_class': BayesianOptimiser,
'param_grid': param_grid,
'greedy_filter_opt_func': f1.fit,
'cv': 3,
'num_cores': 4,
'verbose': 1
}
Instantiate class and run fit method¶
Once the parameters have been set, we can run the .fit() method to search for the best rule optimisation parameters.
[20]:
gs_cv = GridSearchCV(**params)
4 unique parameter sets
[21]:
gs_cv.fit(
X=X_train,
y=y_train
)
--- Fitting and validating rules using folds ---
100%|██████████| 12/12 [00:07<00:00, 1.67it/s]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---
Outputs¶
The .fit() method does not return any objects. However, it does generate the following useful attributes:
rule_strings (Dict[str, str]): The rules which achieved the best combined performance, defined using the standard Iguanas string format (values) and their names (keys).
param_results_per_fold (pd.DataFrame): Shows the best combined rule performance observed for each parameter set and fold.
param_results_aggregated (pd.DataFrame): Shows the mean and the standard deviation of the best combined rule performance, calculated across the folds, for each parameter set.
best_perf (float): The best combined rule performance achieved.
best_params (dict): The parameter set that achieved the best combined rule performance.
[22]:
gs_cv.param_results_per_fold.head()
[22]:
rule_lambdas | lambda_kwargs | opt_func | n_iter | algorithm | verbose | Performance | ||
---|---|---|---|---|---|---|---|---|
Fold | ParamSetIndex | |||||||
0 | 0 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=0.5> | 30 | <function suggest at 0x7f9460945700> | 0 | 0.608108 |
1 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=0.5> | 30 | <function suggest at 0x7f9460954ee0> | 0 | 0.614379 | |
2 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=1> | 30 | <function suggest at 0x7f9460945700> | 0 | 0.574586 | |
3 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=1> | 30 | <function suggest at 0x7f9460954ee0> | 0 | 0.633803 | |
1 | 0 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=0.5> | 30 | <function suggest at 0x7f9460945700> | 0 | 0.546763 |
[23]:
gs_cv.param_results_aggregated.head()
[23]:
rule_lambdas | lambda_kwargs | opt_func | n_iter | algorithm | verbose | PerformancePerFold | MeanPerformance | StdDevPerformance | |
---|---|---|---|---|---|---|---|---|---|
ParamSetIndex | |||||||||
0 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=0.5> | 30 | <function suggest at 0x7f9460945700> | 0 | [0.608108108108108, 0.5467625899280575, 0.5350... | 0.563301 | 0.032043 |
1 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=0.5> | 30 | <function suggest at 0x7f9460954ee0> | 0 | [0.6143790849673203, 0.562962962962963, 0.5323... | 0.569905 | 0.033836 |
2 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=1> | 30 | <function suggest at 0x7f9460945700> | 0 | [0.574585635359116, 0.5454545454545454, 0.5362... | 0.552091 | 0.016346 |
3 | {'RGDT_Rule_20211202_100': <function ConvertRu... | {'RGDT_Rule_20211202_100': {'account_number_su... | <bound method FScore.fit of FScore with beta=1> | 30 | <function suggest at 0x7f9460954ee0> | 0 | [0.6338028169014086, 0.5735294117647058, 0.549... | 0.585451 | 0.035624 |
[24]:
gs_cv.best_perf
[24]:
0.5854506121697506
[25]:
gs_cv.best_params
[25]:
{'rule_lambdas': {'RGDT_Rule_20211202_100': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_104': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_6': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_39': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_97': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_112': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_121': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_46': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_45': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_23': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_11': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_22': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_128': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_114': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_127': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_113': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_150': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_141': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_44': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_60': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_52': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_10': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_59': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_140': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_117': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_55': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_36': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_78': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_77': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_64': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_18': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_95': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_94': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_96': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_16': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_79': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_75': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_9': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_88': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_43': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_8': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_15': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_41': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_65': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'RGDT_Rule_20211202_30': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>},
'lambda_kwargs': {'RGDT_Rule_20211202_100': {'account_number_sum_order_total_per_account_number_1day': 373.625,
'account_number_sum_order_total_per_account_number_90day': 259.27},
'RGDT_Rule_20211202_104': {'account_number_sum_order_total_per_account_number_1day': 421.565},
'RGDT_Rule_20211202_6': {'account_number_avg_order_total_per_account_number_1day': 421.565},
'RGDT_Rule_20211202_39': {'account_number_avg_order_total_per_account_number_7day': 572.83499},
'RGDT_Rule_20211202_97': {'account_number_sum_order_total_per_account_number_1day': 182.95},
'RGDT_Rule_20211202_112': {'account_number_sum_order_total_per_account_number_1day': 618.095},
'RGDT_Rule_20211202_121': {'account_number_sum_order_total_per_account_number_30day': 622.595},
'RGDT_Rule_20211202_46': {'account_number_avg_order_total_per_account_number_7day': 616.14749},
'RGDT_Rule_20211202_45': {'account_number_avg_order_total_per_account_number_7day': 616.14749},
'RGDT_Rule_20211202_23': {'account_number_avg_order_total_per_account_number_30day': 616.14749},
'RGDT_Rule_20211202_11': {'account_number_avg_order_total_per_account_number_1day': 916.82501},
'RGDT_Rule_20211202_22': {'account_number_avg_order_total_per_account_number_30day': 616.14749,
'account_number_sum_order_total_per_account_number_90day': 916.82501},
'RGDT_Rule_20211202_128': {'account_number_sum_order_total_per_account_number_30day': 916.82501},
'RGDT_Rule_20211202_114': {'account_number_sum_order_total_per_account_number_1day': 916.82501},
'RGDT_Rule_20211202_127': {'account_number_sum_order_total_per_account_number_30day': 916.82501},
'RGDT_Rule_20211202_113': {'account_number_sum_order_total_per_account_number_1day': 916.82501},
'RGDT_Rule_20211202_150': {'account_number_sum_order_total_per_account_number_90day': 916.82501},
'RGDT_Rule_20211202_141': {'account_number_sum_order_total_per_account_number_7day': 916.82501},
'RGDT_Rule_20211202_44': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
'account_number_sum_order_total_per_account_number_1day': 927.45001},
'RGDT_Rule_20211202_60': {'account_number_avg_order_total_per_account_number_90day': 966.82501,
'account_number_sum_order_total_per_account_number_30day': 622.595},
'RGDT_Rule_20211202_52': {'account_number_avg_order_total_per_account_number_7day': 971.565,
'account_number_avg_order_total_per_account_number_90day': 966.82501,
'account_number_sum_order_total_per_account_number_30day': 622.595},
'RGDT_Rule_20211202_10': {'account_number_avg_order_total_per_account_number_1day': 916.82501},
'RGDT_Rule_20211202_59': {'account_number_avg_order_total_per_account_number_90day': 916.82501},
'RGDT_Rule_20211202_140': {'account_number_sum_order_total_per_account_number_7day': 916.82501,
'account_number_sum_order_total_per_account_number_90day': 971.565},
'RGDT_Rule_20211202_117': {'account_number_sum_order_total_per_account_number_1day': 971.565,
'account_number_sum_order_total_per_account_number_90day': 916.82501},
'RGDT_Rule_20211202_55': {'account_number_avg_order_total_per_account_number_90day': 225.18999,
'account_number_num_distinct_transaction_per_account_number_1day': 2.0,
'account_number_sum_order_total_per_account_number_30day': 957.375},
'RGDT_Rule_20211202_36': {'account_number_avg_order_total_per_account_number_7day': 265.78751,
'account_number_num_order_items_per_account_number_1day': 4.0},
'RGDT_Rule_20211202_78': {'account_number_num_order_items_per_account_number_1day': 4.0,
'account_number_sum_order_total_per_account_number_7day': 957.375},
'RGDT_Rule_20211202_77': {'account_number_num_order_items_per_account_number_1day': 4.0,
'account_number_sum_order_total_per_account_number_1day': 957.375},
'RGDT_Rule_20211202_64': {'account_number_num_distinct_transaction_per_account_number_1day': 3.0,
'account_number_sum_order_total_per_account_number_7day': 916.82501,
'account_number_sum_order_total_per_account_number_90day': 957.375},
'RGDT_Rule_20211202_18': {'account_number_avg_order_total_per_account_number_30day': 230.36,
'account_number_num_order_items_per_account_number_7day': 4.0,
'account_number_sum_order_total_per_account_number_7day': 1241.85498},
'RGDT_Rule_20211202_95': {'account_number_sum_order_total_per_account_number_1day': 1413.84998},
'RGDT_Rule_20211202_94': {'account_number_sum_order_total_per_account_number_1day': 1407.375,
'account_number_sum_order_total_per_account_number_7day': 957.375},
'RGDT_Rule_20211202_96': {'account_number_sum_order_total_per_account_number_1day': 1445.32501,
'account_number_sum_order_total_per_account_number_90day': 916.82501},
'RGDT_Rule_20211202_16': {'account_number_avg_order_total_per_account_number_30day': 230.36,
'account_number_num_distinct_transaction_per_account_number_1day': 3.0,
'account_number_num_order_items_per_account_number_7day': 4.0,
'account_number_sum_order_total_per_account_number_7day': 1241.85498},
'RGDT_Rule_20211202_79': {'account_number_num_order_items_per_account_number_1day': 5.0,
'account_number_num_order_items_per_account_number_lifetime': 5.0,
'account_number_sum_order_total_per_account_number_1day': 957.375,
'account_number_sum_order_total_per_account_number_30day': 916.82501},
'RGDT_Rule_20211202_75': {'account_number_num_distinct_transaction_per_account_number_90day': 4.0,
'account_number_num_order_items_per_account_number_lifetime': 4.0,
'account_number_sum_order_total_per_account_number_30day': 916.82501},
'RGDT_Rule_20211202_9': {'account_number_avg_order_total_per_account_number_1day': 916.82501,
'account_number_sum_order_total_per_account_number_30day': 1063.53003},
'RGDT_Rule_20211202_88': {'account_number_num_order_items_per_account_number_7day': 7.0},
'RGDT_Rule_20211202_43': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
'account_number_num_distinct_transaction_per_account_number_30day': 2.0},
'RGDT_Rule_20211202_8': {'account_number_avg_order_total_per_account_number_1day': 916.82501,
'account_number_num_order_items_per_account_number_lifetime': 3.0},
'RGDT_Rule_20211202_15': {'account_number_avg_order_total_per_account_number_30day': 1562.48999,
'account_number_sum_order_total_per_account_number_7day': 623.125,
'num_order_items': 3.0},
'RGDT_Rule_20211202_41': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
'account_number_avg_order_total_per_account_number_90day': 1123.36505,
'account_number_num_order_items_per_account_number_30day': 3.0},
'RGDT_Rule_20211202_65': {'account_number_num_distinct_transaction_per_account_number_1day': 4.0,
'account_number_sum_order_total_per_account_number_7day': 623.125,
'num_order_items': 3.0},
'RGDT_Rule_20211202_30': {'account_number_avg_order_total_per_account_number_7day': 1577.02002,
'account_number_sum_order_total_per_account_number_1day': 916.82501}},
'opt_func': <bound method FScore.fit of FScore with beta=1>,
'n_iter': 30,
'algorithm': <function hyperopt.anneal.suggest(new_ids, domain, trials, seed, *args, **kwargs)>,
'verbose': 0}
We can also plot the overall performance of each set of rules (optimised using each parameter set) on each of the validation datasets:
[26]:
gs_cv.plot_top_n_performance_by_fold(figsize=(8, 4))
Apply rules to a separate dataset¶
Use the .transform() method to apply the best performing rules overall to a separate dataset.
[27]:
X_rules_test = gs_cv.transform(
X=X_test,
y=y_test,
sample_weight=None
)
Outputs¶
The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.
A useful attribute created by running the .transform() method is:
rule_descriptions: A dataframe showing the logic of the optimised rules and their performance metrics as applied to the given dataset.
[28]:
gs_cv.rule_descriptions.head()
[28]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
RGDT_Rule_20211202_100 | 0.847826 | 0.336207 | 0.010497 | 0.481481 | (X['account_number_sum_order_total_per_account... | 3 |
RGDT_Rule_20211202_22 | 0.857143 | 0.310345 | 0.009585 | 0.455696 | (X['account_number_avg_order_total_per_account... | 3 |
RGDT_Rule_20211202_44 | 0.857143 | 0.310345 | 0.009585 | 0.455696 | (X['account_number_avg_order_total_per_account... | 3 |
RGDT_Rule_20211202_60 | 0.857143 | 0.310345 | 0.009585 | 0.455696 | (X['account_number_avg_order_total_per_account... | 3 |
RGDT_Rule_20211202_9 | 0.857143 | 0.310345 | 0.009585 | 0.455696 | (X['account_number_avg_order_total_per_account... | 3 |
[29]:
X_rules_test.head()
[29]:
Rule | RGDT_Rule_20211202_100 | RGDT_Rule_20211202_22 | RGDT_Rule_20211202_44 | RGDT_Rule_20211202_60 | RGDT_Rule_20211202_9 | RGDT_Rule_20211202_52 | RGDT_Rule_20211202_97 | RGDT_Rule_20211202_141 | RGDT_Rule_20211202_140 | RGDT_Rule_20211202_150 | RGDT_Rule_20211202_113 | RGDT_Rule_20211202_117 | RGDT_Rule_20211202_41 | RGDT_Rule_20211202_64 | RGDT_Rule_20211202_16 | RGDT_Rule_20211202_43 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
eid | ||||||||||||||||
975-8351797-7122581 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
785-6259585-7858053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
057-4039373-1790681 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
095-5263240-3834186 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
980-3802574-0009480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |