Grid Search CV Example

This notebook contains an example of how the Grid Search CV module can be used to search for the best parameters when generating or optimising a set of rules.

The Grid Search CV process is as follows: * Generate k-fold stratified cross validation datasets. * Generate unique sets of rule generation/optimisation parameters from the given search space. * For each of the training and validation datasets: * Train/optimise a set of rules using each unique set of parameters (using the training dataset). * Use the GreedyFilter class (in the rule_selection module) to calculate the maximum overall performance of each rule set on the validation dataset. See the class’ example notebook for more information. * Return the parameters which generated the highest mean overall performance across the validation datasets.

Requirements

To run, you’ll need the following:

  • A labelled, processed dataset (nulls imputed, categorical features encoded).


Import packages

[1]:
from iguanas.rule_selection import GridSearchCV
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.metrics.classification import FScore

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from hyperopt import tpe, anneal

Read in data

Let’s read in some labelled, processed dummy data.

[2]:
X_train = pd.read_csv(
    'dummy_data/X_train_gs.csv',
    index_col='eid'
)
y_train = pd.read_csv(
    'dummy_data/y_train_gs.csv',
    index_col='eid'
).squeeze()
X_test = pd.read_csv(
    'dummy_data/X_test_gs.csv',
    index_col='eid'
)
y_test = pd.read_csv(
    'dummy_data/y_test_gs.csv',
    index_col='eid'
).squeeze()

Generate rules using GridSearchCV

We can use the GridSearchCV class to implement stratifield k-fold cross validation when searching for the best rule generation parameters. This allows us to find the best rule generation parameters whilst also reducing the likelihood of overfitting.

Set up class parameters

We first need to define the search values for each parameter in the provided rule generation class. We define these in a dictionary, where the dictionary keys are the relevant rule generation parameters and the dictionary values are lists of search values for each parameter. The GridSearchCV class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.

Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.

[3]:
f1 = FScore(beta=1)
[4]:
param_grid = {
    'opt_func': [f1.fit],
    'n_total_conditions': [1, 4],
    'tree_ensemble': [
        RandomForestClassifier(n_estimators=5, random_state=0),
        RandomForestClassifier(n_estimators=15, random_state=0),
    ],
    'target_feat_corr_types': [None, 'Infer']
}

Now that we have our search values, we can define the rest of the GridSearchCV class parameters. Note here that we are splitting the data into 3 folds for training/validation.

Please see the class docstring for more information on each parameter

[5]:
params = {
    'rule_class': RuleGeneratorDT,
    'param_grid': param_grid,
    'greedy_filter_opt_func': f1.fit,
    'cv': 3,
    'num_cores': 4,
    'verbose': 1
}

Instantiate class and run fit method

Once the parameters have been set, we can run the .fit() method to search for the best rule generation parameters.

[6]:
gs_cv = GridSearchCV(**params)
8 unique parameter sets
[7]:
gs_cv.fit(
    X=X_train,
    y=y_train
)
--- Fitting and validating rules using folds ---
100%|██████████| 24/24 [00:04<00:00,  5.31it/s]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---

Outputs

The .fit() method does not return any objects. However, it does generate the following useful attributes:

  • rule_strings (Dict[str, str]): The rules which achieved the best combined performance, defined using the standard Iguanas string format (values) and their names (keys).

  • param_results_per_fold (pd.DataFrame): Shows the best combined rule performance observed for each parameter set and fold.

  • param_results_aggregated (pd.DataFrame): Shows the mean and the standard deviation of the best combined rule performance, calculated across the folds, for each parameter set.

  • best_perf (float): The best combined rule performance achieved.

  • best_params (dict): The parameter set that achieved the best combined rule performance.

[8]:
gs_cv.param_results_per_fold.head()
[8]:
opt_func n_total_conditions tree_ensemble target_feat_corr_types Performance
Fold ParamSetIndex
0 0 <bound method FScore.fit of FScore with beta=1> 1 RandomForestClassifier(n_estimators=5, random_... None 0.511278
1 <bound method FScore.fit of FScore with beta=1> 1 RandomForestClassifier(n_estimators=5, random_... Infer 0.511278
2 <bound method FScore.fit of FScore with beta=1> 1 (DecisionTreeClassifier(max_depth=4, max_featu... None 0.532258
3 <bound method FScore.fit of FScore with beta=1> 1 (DecisionTreeClassifier(max_depth=4, max_featu... Infer 0.532258
4 <bound method FScore.fit of FScore with beta=1> 4 RandomForestClassifier(n_estimators=5, random_... None 0.614458
[9]:
gs_cv.param_results_aggregated.head()
[9]:
opt_func n_total_conditions tree_ensemble target_feat_corr_types PerformancePerFold MeanPerformance StdDevPerformance
ParamSetIndex
0 <bound method FScore.fit of FScore with beta=1> 1 RandomForestClassifier(n_estimators=5, random_... None [0.5112781954887218, 0.35185185185185186, 0.38... 0.417464 0.068072
1 <bound method FScore.fit of FScore with beta=1> 1 RandomForestClassifier(n_estimators=5, random_... Infer [0.5112781954887218, 0.35185185185185186, 0.38... 0.417464 0.068072
2 <bound method FScore.fit of FScore with beta=1> 1 (DecisionTreeClassifier(max_depth=4, max_featu... None [0.532258064516129, 0.38961038961038963, 0.402... 0.441582 0.064346
3 <bound method FScore.fit of FScore with beta=1> 1 (DecisionTreeClassifier(max_depth=4, max_featu... Infer [0.532258064516129, 0.38961038961038963, 0.402... 0.441582 0.064346
4 <bound method FScore.fit of FScore with beta=1> 4 RandomForestClassifier(n_estimators=5, random_... None [0.6144578313253012, 0.538860103626943, 0.5] 0.551106 0.047523
[10]:
gs_cv.best_perf
[10]:
0.5829592533675241
[11]:
gs_cv.best_params
[11]:
{'opt_func': <bound method FScore.fit of FScore with beta=1>,
 'n_total_conditions': 4,
 'tree_ensemble': RandomForestClassifier(max_depth=4, n_estimators=15, random_state=0),
 'target_feat_corr_types': 'Infer'}

We can also plot the overall performance of each set of rules (generated using each parameter set) on each of the validation datasets:

[12]:
gs_cv.plot_top_n_performance_by_fold(figsize=(8, 4))
../../_images/examples_rule_selection_grid_search_cv_example_30_0.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_1.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_2.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_3.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_4.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_5.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_6.svg
../../_images/examples_rule_selection_grid_search_cv_example_30_7.svg

Apply rules to a separate dataset

Use the .transform() method to apply the best performing rules overall to a separate dataset.

[13]:
X_rules_test = gs_cv.transform(
    X=X_test,
    y=y_test,
    sample_weight=None
)

Outputs

The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.

A useful attribute created by running the .transform() method is:

  • rule_descriptions: A dataframe showing the logic of the generated rules and their performance metrics as applied to the given dataset.

[14]:
gs_cv.rule_descriptions.head()
[14]:
Precision Recall PercDataFlagged OptMetric Logic nConditions
Rule
RGDT_Rule_20211202_100 0.847826 0.336207 0.010497 0.481481 (X['account_number_sum_order_total_per_account... 3
RGDT_Rule_20211202_6 0.860465 0.318966 0.009813 0.465409 (X['account_number_avg_order_total_per_account... 2
RGDT_Rule_20211202_104 0.860465 0.318966 0.009813 0.465409 (X['account_number_sum_order_total_per_account... 2
RGDT_Rule_20211202_39 0.837838 0.267241 0.008444 0.405229 (X['account_number_avg_order_total_per_account... 2
RGDT_Rule_20211202_112 0.852941 0.250000 0.007759 0.386667 (X['account_number_sum_order_total_per_account... 2
[15]:
X_rules_test.head()
[15]:
Rule RGDT_Rule_20211202_100 RGDT_Rule_20211202_6 RGDT_Rule_20211202_104 RGDT_Rule_20211202_39 RGDT_Rule_20211202_112 RGDT_Rule_20211202_46 RGDT_Rule_20211202_97 RGDT_Rule_20211202_121 RGDT_Rule_20211202_45 RGDT_Rule_20211202_23 ... RGDT_Rule_20211202_95 RGDT_Rule_20211202_94 RGDT_Rule_20211202_16 RGDT_Rule_20211202_88 RGDT_Rule_20211202_8 RGDT_Rule_20211202_43 RGDT_Rule_20211202_65 RGDT_Rule_20211202_15 RGDT_Rule_20211202_41 RGDT_Rule_20211202_30
eid
975-8351797-7122581 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
785-6259585-7858053 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
057-4039373-1790681 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
095-5263240-3834186 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
980-3802574-0009480 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 45 columns


Optimise rules using GridSearchCV

We can use the GridSearchCV class to implement stratifield k-fold cross validation when searching for the best rule optimisation parameters. This allows us to find the best rule optimisation parameters whilst also reducing the likelihood of overfitting.

Set up class parameters

We firstly need to define the search values for each parameter in the provided rule optimisation class. We define these in a dictionary, where the dictionary keys are the relevant rule optimisation parameters and the dictionary values are lists of search values for each parameter. The GridSearchCV class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.

Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.

[16]:
f0dot5 = FScore(beta=0.5)
f1 = FScore(beta=1)

In this example, we’ll optimise the rules we generated in the previous Grid Search exercise. To do this, we need to convert the generated rules into the standard Iguanas lambda expression format:

[17]:
rule_lambdas = gs_cv.as_rule_lambdas(
    as_numpy=False,
    with_kwargs=True
)
lambda_kwargs = gs_cv.lambda_kwargs
[18]:
param_grid = {
    'rule_lambdas': [rule_lambdas],
    'lambda_kwargs': [lambda_kwargs],
    'opt_func': [f0dot5.fit, f1.fit],
    'n_iter': [30],
    'algorithm': [tpe.suggest, anneal.suggest],
    'verbose': [0]
}

Now that we have our search values, we can define the rest of the GridSearchCV class parameters. Note here that we are splitting the data into 3 folds for training/validation.

Please see the class docstring for more information on each parameter

[19]:
params = {
    'rule_class': BayesianOptimiser,
    'param_grid': param_grid,
    'greedy_filter_opt_func': f1.fit,
    'cv': 3,
    'num_cores': 4,
    'verbose': 1
}

Instantiate class and run fit method

Once the parameters have been set, we can run the .fit() method to search for the best rule optimisation parameters.

[20]:
gs_cv = GridSearchCV(**params)
4 unique parameter sets
[21]:
gs_cv.fit(
    X=X_train,
    y=y_train
)
--- Fitting and validating rules using folds ---
100%|██████████| 12/12 [00:07<00:00,  1.67it/s]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---

Outputs

The .fit() method does not return any objects. However, it does generate the following useful attributes:

  • rule_strings (Dict[str, str]): The rules which achieved the best combined performance, defined using the standard Iguanas string format (values) and their names (keys).

  • param_results_per_fold (pd.DataFrame): Shows the best combined rule performance observed for each parameter set and fold.

  • param_results_aggregated (pd.DataFrame): Shows the mean and the standard deviation of the best combined rule performance, calculated across the folds, for each parameter set.

  • best_perf (float): The best combined rule performance achieved.

  • best_params (dict): The parameter set that achieved the best combined rule performance.

[22]:
gs_cv.param_results_per_fold.head()
[22]:
rule_lambdas lambda_kwargs opt_func n_iter algorithm verbose Performance
Fold ParamSetIndex
0 0 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=0.5> 30 <function suggest at 0x7f9460945700> 0 0.608108
1 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=0.5> 30 <function suggest at 0x7f9460954ee0> 0 0.614379
2 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=1> 30 <function suggest at 0x7f9460945700> 0 0.574586
3 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=1> 30 <function suggest at 0x7f9460954ee0> 0 0.633803
1 0 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=0.5> 30 <function suggest at 0x7f9460945700> 0 0.546763
[23]:
gs_cv.param_results_aggregated.head()
[23]:
rule_lambdas lambda_kwargs opt_func n_iter algorithm verbose PerformancePerFold MeanPerformance StdDevPerformance
ParamSetIndex
0 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=0.5> 30 <function suggest at 0x7f9460945700> 0 [0.608108108108108, 0.5467625899280575, 0.5350... 0.563301 0.032043
1 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=0.5> 30 <function suggest at 0x7f9460954ee0> 0 [0.6143790849673203, 0.562962962962963, 0.5323... 0.569905 0.033836
2 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=1> 30 <function suggest at 0x7f9460945700> 0 [0.574585635359116, 0.5454545454545454, 0.5362... 0.552091 0.016346
3 {'RGDT_Rule_20211202_100': <function ConvertRu... {'RGDT_Rule_20211202_100': {'account_number_su... <bound method FScore.fit of FScore with beta=1> 30 <function suggest at 0x7f9460954ee0> 0 [0.6338028169014086, 0.5735294117647058, 0.549... 0.585451 0.035624
[24]:
gs_cv.best_perf
[24]:
0.5854506121697506
[25]:
gs_cv.best_params
[25]:
{'rule_lambdas': {'RGDT_Rule_20211202_100': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_104': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_6': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_39': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_97': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_112': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_121': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_46': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_45': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_23': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_11': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_22': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_128': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_114': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_127': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_113': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_150': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_141': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_44': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_60': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_52': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_10': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_59': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_140': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_117': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_55': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_36': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_78': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_77': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_64': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_18': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_95': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_94': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_96': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_16': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_79': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_75': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_9': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_88': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_43': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_8': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_15': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_41': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_65': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211202_30': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>},
 'lambda_kwargs': {'RGDT_Rule_20211202_100': {'account_number_sum_order_total_per_account_number_1day': 373.625,
   'account_number_sum_order_total_per_account_number_90day': 259.27},
  'RGDT_Rule_20211202_104': {'account_number_sum_order_total_per_account_number_1day': 421.565},
  'RGDT_Rule_20211202_6': {'account_number_avg_order_total_per_account_number_1day': 421.565},
  'RGDT_Rule_20211202_39': {'account_number_avg_order_total_per_account_number_7day': 572.83499},
  'RGDT_Rule_20211202_97': {'account_number_sum_order_total_per_account_number_1day': 182.95},
  'RGDT_Rule_20211202_112': {'account_number_sum_order_total_per_account_number_1day': 618.095},
  'RGDT_Rule_20211202_121': {'account_number_sum_order_total_per_account_number_30day': 622.595},
  'RGDT_Rule_20211202_46': {'account_number_avg_order_total_per_account_number_7day': 616.14749},
  'RGDT_Rule_20211202_45': {'account_number_avg_order_total_per_account_number_7day': 616.14749},
  'RGDT_Rule_20211202_23': {'account_number_avg_order_total_per_account_number_30day': 616.14749},
  'RGDT_Rule_20211202_11': {'account_number_avg_order_total_per_account_number_1day': 916.82501},
  'RGDT_Rule_20211202_22': {'account_number_avg_order_total_per_account_number_30day': 616.14749,
   'account_number_sum_order_total_per_account_number_90day': 916.82501},
  'RGDT_Rule_20211202_128': {'account_number_sum_order_total_per_account_number_30day': 916.82501},
  'RGDT_Rule_20211202_114': {'account_number_sum_order_total_per_account_number_1day': 916.82501},
  'RGDT_Rule_20211202_127': {'account_number_sum_order_total_per_account_number_30day': 916.82501},
  'RGDT_Rule_20211202_113': {'account_number_sum_order_total_per_account_number_1day': 916.82501},
  'RGDT_Rule_20211202_150': {'account_number_sum_order_total_per_account_number_90day': 916.82501},
  'RGDT_Rule_20211202_141': {'account_number_sum_order_total_per_account_number_7day': 916.82501},
  'RGDT_Rule_20211202_44': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
   'account_number_sum_order_total_per_account_number_1day': 927.45001},
  'RGDT_Rule_20211202_60': {'account_number_avg_order_total_per_account_number_90day': 966.82501,
   'account_number_sum_order_total_per_account_number_30day': 622.595},
  'RGDT_Rule_20211202_52': {'account_number_avg_order_total_per_account_number_7day': 971.565,
   'account_number_avg_order_total_per_account_number_90day': 966.82501,
   'account_number_sum_order_total_per_account_number_30day': 622.595},
  'RGDT_Rule_20211202_10': {'account_number_avg_order_total_per_account_number_1day': 916.82501},
  'RGDT_Rule_20211202_59': {'account_number_avg_order_total_per_account_number_90day': 916.82501},
  'RGDT_Rule_20211202_140': {'account_number_sum_order_total_per_account_number_7day': 916.82501,
   'account_number_sum_order_total_per_account_number_90day': 971.565},
  'RGDT_Rule_20211202_117': {'account_number_sum_order_total_per_account_number_1day': 971.565,
   'account_number_sum_order_total_per_account_number_90day': 916.82501},
  'RGDT_Rule_20211202_55': {'account_number_avg_order_total_per_account_number_90day': 225.18999,
   'account_number_num_distinct_transaction_per_account_number_1day': 2.0,
   'account_number_sum_order_total_per_account_number_30day': 957.375},
  'RGDT_Rule_20211202_36': {'account_number_avg_order_total_per_account_number_7day': 265.78751,
   'account_number_num_order_items_per_account_number_1day': 4.0},
  'RGDT_Rule_20211202_78': {'account_number_num_order_items_per_account_number_1day': 4.0,
   'account_number_sum_order_total_per_account_number_7day': 957.375},
  'RGDT_Rule_20211202_77': {'account_number_num_order_items_per_account_number_1day': 4.0,
   'account_number_sum_order_total_per_account_number_1day': 957.375},
  'RGDT_Rule_20211202_64': {'account_number_num_distinct_transaction_per_account_number_1day': 3.0,
   'account_number_sum_order_total_per_account_number_7day': 916.82501,
   'account_number_sum_order_total_per_account_number_90day': 957.375},
  'RGDT_Rule_20211202_18': {'account_number_avg_order_total_per_account_number_30day': 230.36,
   'account_number_num_order_items_per_account_number_7day': 4.0,
   'account_number_sum_order_total_per_account_number_7day': 1241.85498},
  'RGDT_Rule_20211202_95': {'account_number_sum_order_total_per_account_number_1day': 1413.84998},
  'RGDT_Rule_20211202_94': {'account_number_sum_order_total_per_account_number_1day': 1407.375,
   'account_number_sum_order_total_per_account_number_7day': 957.375},
  'RGDT_Rule_20211202_96': {'account_number_sum_order_total_per_account_number_1day': 1445.32501,
   'account_number_sum_order_total_per_account_number_90day': 916.82501},
  'RGDT_Rule_20211202_16': {'account_number_avg_order_total_per_account_number_30day': 230.36,
   'account_number_num_distinct_transaction_per_account_number_1day': 3.0,
   'account_number_num_order_items_per_account_number_7day': 4.0,
   'account_number_sum_order_total_per_account_number_7day': 1241.85498},
  'RGDT_Rule_20211202_79': {'account_number_num_order_items_per_account_number_1day': 5.0,
   'account_number_num_order_items_per_account_number_lifetime': 5.0,
   'account_number_sum_order_total_per_account_number_1day': 957.375,
   'account_number_sum_order_total_per_account_number_30day': 916.82501},
  'RGDT_Rule_20211202_75': {'account_number_num_distinct_transaction_per_account_number_90day': 4.0,
   'account_number_num_order_items_per_account_number_lifetime': 4.0,
   'account_number_sum_order_total_per_account_number_30day': 916.82501},
  'RGDT_Rule_20211202_9': {'account_number_avg_order_total_per_account_number_1day': 916.82501,
   'account_number_sum_order_total_per_account_number_30day': 1063.53003},
  'RGDT_Rule_20211202_88': {'account_number_num_order_items_per_account_number_7day': 7.0},
  'RGDT_Rule_20211202_43': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
   'account_number_num_distinct_transaction_per_account_number_30day': 2.0},
  'RGDT_Rule_20211202_8': {'account_number_avg_order_total_per_account_number_1day': 916.82501,
   'account_number_num_order_items_per_account_number_lifetime': 3.0},
  'RGDT_Rule_20211202_15': {'account_number_avg_order_total_per_account_number_30day': 1562.48999,
   'account_number_sum_order_total_per_account_number_7day': 623.125,
   'num_order_items': 3.0},
  'RGDT_Rule_20211202_41': {'account_number_avg_order_total_per_account_number_7day': 616.14749,
   'account_number_avg_order_total_per_account_number_90day': 1123.36505,
   'account_number_num_order_items_per_account_number_30day': 3.0},
  'RGDT_Rule_20211202_65': {'account_number_num_distinct_transaction_per_account_number_1day': 4.0,
   'account_number_sum_order_total_per_account_number_7day': 623.125,
   'num_order_items': 3.0},
  'RGDT_Rule_20211202_30': {'account_number_avg_order_total_per_account_number_7day': 1577.02002,
   'account_number_sum_order_total_per_account_number_1day': 916.82501}},
 'opt_func': <bound method FScore.fit of FScore with beta=1>,
 'n_iter': 30,
 'algorithm': <function hyperopt.anneal.suggest(new_ids, domain, trials, seed, *args, **kwargs)>,
 'verbose': 0}

We can also plot the overall performance of each set of rules (optimised using each parameter set) on each of the validation datasets:

[26]:
gs_cv.plot_top_n_performance_by_fold(figsize=(8, 4))
../../_images/examples_rule_selection_grid_search_cv_example_61_0.svg
../../_images/examples_rule_selection_grid_search_cv_example_61_1.svg
../../_images/examples_rule_selection_grid_search_cv_example_61_2.svg
../../_images/examples_rule_selection_grid_search_cv_example_61_3.svg

Apply rules to a separate dataset

Use the .transform() method to apply the best performing rules overall to a separate dataset.

[27]:
X_rules_test = gs_cv.transform(
    X=X_test,
    y=y_test,
    sample_weight=None
)

Outputs

The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.

A useful attribute created by running the .transform() method is:

  • rule_descriptions: A dataframe showing the logic of the optimised rules and their performance metrics as applied to the given dataset.

[28]:
gs_cv.rule_descriptions.head()
[28]:
Precision Recall PercDataFlagged OptMetric Logic nConditions
Rule
RGDT_Rule_20211202_100 0.847826 0.336207 0.010497 0.481481 (X['account_number_sum_order_total_per_account... 3
RGDT_Rule_20211202_22 0.857143 0.310345 0.009585 0.455696 (X['account_number_avg_order_total_per_account... 3
RGDT_Rule_20211202_44 0.857143 0.310345 0.009585 0.455696 (X['account_number_avg_order_total_per_account... 3
RGDT_Rule_20211202_60 0.857143 0.310345 0.009585 0.455696 (X['account_number_avg_order_total_per_account... 3
RGDT_Rule_20211202_9 0.857143 0.310345 0.009585 0.455696 (X['account_number_avg_order_total_per_account... 3
[29]:
X_rules_test.head()
[29]:
Rule RGDT_Rule_20211202_100 RGDT_Rule_20211202_22 RGDT_Rule_20211202_44 RGDT_Rule_20211202_60 RGDT_Rule_20211202_9 RGDT_Rule_20211202_52 RGDT_Rule_20211202_97 RGDT_Rule_20211202_141 RGDT_Rule_20211202_140 RGDT_Rule_20211202_150 RGDT_Rule_20211202_113 RGDT_Rule_20211202_117 RGDT_Rule_20211202_41 RGDT_Rule_20211202_64 RGDT_Rule_20211202_16 RGDT_Rule_20211202_43
eid
975-8351797-7122581 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
785-6259585-7858053 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
057-4039373-1790681 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
095-5263240-3834186 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
980-3802574-0009480 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0