Direct Search Optimiser Example

This notebook contains an example of how the Direct Search Optimiser module can be used to optimise the thresholds of an existing set of rules, given a labelled dataset, using Direct Search-type Optimisation algorithms.

Requirements

To run, you’ll need the following:

  • A rule set stored in the standard Iguanas lambda expression format, along with the keyword arguments for each lambda expression (more information on how to do this later)

  • A labelled dataset containing the features present in the rule set.


Import packages

[1]:
from iguanas.rule_optimisation import DirectSearchOptimiser
from iguanas.rules import Rules
from iguanas.rule_application import RuleApplier
from iguanas.metrics.classification import FScore

import pandas as pd
from sklearn.model_selection import train_test_split

Read in data

Firstly, we need to read in the raw data containing the features and the fraud label:

[2]:
data = pd.read_csv(
    'dummy_data/dummy_pipeline_output_data.csv',
    index_col='eid'
)

Then we need to split out the dataset into the features (X) and the target column (y):

[3]:
fraud_column = 'sim_is_fraud'
X = data.drop(fraud_column, axis=1)
y = data[fraud_column]

Finally, we can split the features and target column into training and test sets:

[4]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=0
)

Read in the rules

In this example, we’ll read in the rule conditions from a pickle file, where they are stored in the standard Iguanas string format. However, you can use any Iguanas-ready rule format - see the example notebook in the rules module.

[5]:
import pickle
with open('dummy_data/rule_strings.pkl', 'rb') as f:
    rule_strings = pickle.load(f)

Now we can instantiate the Rules class with these rules:

[6]:
rules = Rules(rule_strings=rule_strings)

We now need to convert the rules into the standard Iguanas lambda expression format. This format allows new threshold values to be injected into the rule condition before being evaluated - this is how the Bayesian Optimiser finds the optimal threshold values:

[7]:
rule_lambdas = rules.as_rule_lambdas(
    as_numpy=False,
    with_kwargs=True
)

By converting the rule conditions to the standard Iguanas lambda expression format, we also generate a dictionary which gives the keyword arguments to each lambda expression (this dictionary is saved as the class attribute lambda_kwargs). Using these keyword arguments as inputs to the lambda expressions will convert them into the standard Iguanas string format.


Optimise rules

Set up class parameters

Now we can set our class parameters for the Direct Search Optimiser.

Here we’re using the F1 score as the optimisation function (you can choose a different function from the metrics module or create your own - see the classification.ipynb example notebook in the metrics module).

Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.

We’re also using the Nelder-Mead algorithm, which often benefits from setting the optional initial_simplex parameter. This is set through the options keyword argument. Here, we’ll generate the initial simplex of each rule using the create_initial_simplexes() method (but you can create your own if required).

Please see the class docstring for more information on each parameter.

[8]:
f1 = FScore(beta=1)
[9]:
initial_simplexes = DirectSearchOptimiser.create_initial_simplexes(
    X=X_train,
    lambda_kwargs=rules.lambda_kwargs,
    shape='Minimum-based'
)
[10]:
params = {
    'rule_lambdas': rule_lambdas,
    'lambda_kwargs': rules.lambda_kwargs,
    'opt_func': f1.fit,
    'method': 'Nelder-Mead',
    'options': initial_simplexes,
    'verbose': 1,
}

Instantiate class and run fit method

Once the parameters have been set, we can run the .fit() method to optimise the rules.

[11]:
ro = DirectSearchOptimiser(**params)
[12]:
X_rules = ro.fit(
    X=X_train,
    y=y_train,
    sample_weight=None
)
--- Checking for rules with features that are missing in `X` ---
100%|██████████| 19/19 [00:00<00:00, 44922.08it/s]
--- Checking for rules that exclusively contain non-optimisable conditions ---
100%|██████████| 19/19 [00:00<00:00, 51747.91it/s]
--- Checking for rules that exclusively contain zero-variance features ---
100%|██████████| 19/19 [00:00<00:00, 162305.04it/s]
--- Optimising rules ---
100%|██████████| 19/19 [00:06<00:00,  2.90it/s]

Outputs

The .fit() method returns a dataframe giving the binary columns of the optimised rules as applied to the training dataset. It also creates the following attributes:

  • rule_strings (dict): The optimised rules stored in the standard Iguanas string format (values) and their names (keys).

  • rule_descriptions (pd.Dataframe): A dataframe showing the logic of the rules and their performance metrics on the training dataset.

  • rule_names_missing_features (list): Names of rules which use features that are not present in the dataset (and therefore can’t be optimised or applied).

  • rule_names_no_opt_conditions (list): Names of rules which have no optimisable conditions (e.g.rules that only contain string-based conditions).

  • rule_names_zero_var_features (list): Names of rules which exclusively contain zero variance features (based on X), so cannot be optimised.

  • opt_rule_performances (dict): The optimisation metric (values) calculated for each optimised rule (keys) on the training set.

  • orig_rule_performances (dict): The optimisation metric (values) calculated for each original rule (keys) on the training set.

[13]:
X_rules.head()
[13]:
Rule HighFraudTxnPerAccountNum RGDT_Rule195 RGDT_Rule263 RGDT_Rule272 RGDT_Rule45 RGDT_Rule193 RGDT_Rule112 RGDT_Rule137 RGDT_Rule24 RGDT_Rule153 RGDT_Rule313 RGDT_Rule35 RGDT_Rule241 RGDT_Rule162 RGDT_Rule65 ComplicatedRule RGDT_Rule256 RGDT_Rule2 RGDT_Rule81
eid
503-0182982-0226911 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0
516-2441570-6696110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
475-5982298-4197297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
935-3613661-7862154 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
936-1684183-4550418 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[14]:
ro.rule_strings
[14]:
{'RGDT_Rule137': "((X['account_number_avg_order_total_per_account_number_7day']<=1319.4598177719115)|(X['account_number_avg_order_total_per_account_number_7day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.05990791320800781)&(X['account_number_num_fraud_transactions_per_account_number_90day']>=-4.349979400634766)&((X['account_number_sum_order_total_per_account_number_1day']<=1828.5315874099724)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
 'RGDT_Rule81': "(X['account_number_avg_order_total_per_account_number_1day']>617.69)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=1.0)&((X['account_number_num_order_items_per_account_number_30day']<=3.0)|(X['account_number_num_order_items_per_account_number_30day'].isna()))&((X['account_number_sum_order_total_per_account_number_1day']<=627.51999)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
 'HighFraudTxnPerAccountNum': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)",
 'RGDT_Rule256': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)&((X['account_number_sum_order_total_per_account_number_90day']<=950.89999)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))&(X['is_billing_shipping_city_same']==False)",
 'RGDT_Rule35': "(X['account_number_num_fraud_transactions_per_account_number_7day']>=0.5185185185185187)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=-7.777777777777779)&((X['account_number_num_order_items_per_account_number_lifetime']<=27.777777777777782)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
 'RGDT_Rule193': "(X['account_number_num_distinct_transaction_per_account_number_30day']>=-2.515625)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.8203125)",
 'RGDT_Rule241': "(X['account_number_avg_order_total_per_account_number_90day']>322.34253616608135)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=-1.3512841817262844)&((X['account_number_sum_order_total_per_account_number_1day']<=2569.17933672239)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>570.010761076064)",
 'RGDT_Rule263': "(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.1640625)&((X['account_number_sum_order_total_per_account_number_1day']<=4216.703906250001)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_7day']>-4464.745312499999)&((X['account_number_sum_order_total_per_account_number_90day']<=6697.117968749999)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))",
 'RGDT_Rule313': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.9541441152207949)&(X['account_number_sum_order_total_per_account_number_30day']>-18.92841299541044)&(X['account_number_sum_order_total_per_account_number_90day']>-854.424793523272)&(X['is_existing_user']==True)",
 'RGDT_Rule195': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.45969962351955473)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.697719914605841)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.7697551276069134)&(X['account_number_num_order_items_per_account_number_7day']>=-0.7995512732304633)",
 'RGDT_Rule153': "(X['account_number_avg_order_total_per_account_number_90day']>-369.66274632096327)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.7715035974979401)&(X['account_number_num_order_items_per_account_number_lifetime']>=1.7915962636470795)&((X['order_total']<=1479.2942406350376)|(X['order_total'].isna()))",
 'RGDT_Rule112': "(X['account_number_num_distinct_transaction_per_account_number_90day']>=-1.7096405029296875)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=-3.7865753173828125)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.07733154296875)&((X['account_number_num_order_items_per_account_number_90day']<=15.665817260742188)|(X['account_number_num_order_items_per_account_number_90day'].isna()))",
 'RGDT_Rule2': "((X['account_number_num_distinct_transaction_per_account_number_1day']<=2.0)|(X['account_number_num_distinct_transaction_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_7day']>=1.0)&((X['account_number_num_fraud_transactions_per_account_number_lifetime']<=1.0)|(X['account_number_num_fraud_transactions_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
 'RGDT_Rule65': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=-2.031659691858139)&((X['account_number_sum_order_total_per_account_number_1day']<=3245.308425296317)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>421.71244916248713)&(X['is_billing_shipping_city_same']==False)",
 'RGDT_Rule45': "((X['account_number_avg_order_total_per_account_number_1day']<=2698.668671875)|(X['account_number_avg_order_total_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_90day']>=0.90234375)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.08203125)&(X['account_number_sum_order_total_per_account_number_90day']>-1956.7710937499999)",
 'RGDT_Rule162': "(X['account_number_num_fraud_transactions_per_account_number_90day']>=-2.23265606791473)&(X['num_order_items']>=0.5197764325298163)&(X['order_total']>622.0727832008924)",
 'RGDT_Rule272': "((X['account_number_avg_order_total_per_account_number_90day']<=2728.116280963719)|(X['account_number_avg_order_total_per_account_number_90day'].isna()))&(X['account_number_num_distinct_transaction_per_account_number_30day']>=-7.8742640763521194)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.719363197684288)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.980542927980423)",
 'RGDT_Rule24': "(X['account_number_avg_order_total_per_account_number_90day']>-1078.024769393904)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.6513598936071503)&((X['account_number_num_order_items_per_account_number_lifetime']<=13.459071105062321)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&((X['account_number_sum_order_total_per_account_number_7day']<=3794.9691484239675)|(X['account_number_sum_order_total_per_account_number_7day'].isna()))",
 'ComplicatedRule': "(X['num_order_items']>0.0)&(((X['order_total']>2025.98)&(X['is_existing_user']==True))|((X['order_total']>5.0)&(X['is_existing_user']==False))|(X['billing_city'].str.startswith('B', na=False)))"}
[15]:
ro.rule_descriptions.head()
[15]:
Precision Recall PercDataFlagged OptMetric Logic nConditions
Rule
HighFraudTxnPerAccountNum 0.991903 1.000000 0.027772 0.995935 (X['account_number_num_fraud_transactions_per_... 1
RGDT_Rule195 0.991870 0.995918 0.027659 0.993890 (X['account_number_num_fraud_transactions_per_... 4
RGDT_Rule263 0.968379 1.000000 0.028446 0.983936 (X['account_number_num_fraud_transactions_per_... 6
RGDT_Rule272 0.968254 0.995918 0.028334 0.981891 ((X['account_number_avg_order_total_per_accoun... 5
RGDT_Rule45 0.953125 0.995918 0.028783 0.974052 ((X['account_number_avg_order_total_per_accoun... 5
[16]:
ro.opt_rule_performances
[16]:
{'HighFraudTxnPerAccountNum': 0.9959349593495934,
 'RGDT_Rule195': 0.9938900203665988,
 'RGDT_Rule263': 0.9839357429718876,
 'RGDT_Rule272': 0.9818913480885312,
 'RGDT_Rule45': 0.9740518962075848,
 'RGDT_Rule193': 0.9740518962075848,
 'RGDT_Rule112': 0.9740518962075848,
 'RGDT_Rule137': 0.972972972972973,
 'RGDT_Rule24': 0.972,
 'RGDT_Rule153': 0.8376470588235295,
 'RGDT_Rule313': 0.4217252396166134,
 'RGDT_Rule35': 0.4217252396166134,
 'RGDT_Rule241': 0.40888888888888886,
 'RGDT_Rule162': 0.39378238341968913,
 'RGDT_Rule65': 0.33532934131736525,
 'ComplicatedRule': 0.20890688259109313,
 'RGDT_Rule256': 0.40909090909090906,
 'RGDT_Rule2': 0.12260536398467435,
 'RGDT_Rule81': 0.0632411067193676}

Apply rules to a separate dataset

Use the .transform() method to apply the optimised rules to a separate dataset:

[17]:
X_rules_test = ro.transform(
    X=X_test,
    y=y_test,
    sample_weight=None
)

Outputs

The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.

A useful attribute created by running the .transform() method is:

  • rule_descriptions: A dataframe showing the logic of the optimised rules and their performance metrics as applied to the given dataset.

[18]:
ro.rule_descriptions.head()
[18]:
Precision Recall PercDataFlagged OptMetric Logic nConditions
Rule
HighFraudTxnPerAccountNum 0.991304 1.000000 0.026244 0.995633 (X['account_number_num_fraud_transactions_per_... 1
RGDT_Rule195 0.991228 0.991228 0.026016 0.991228 (X['account_number_num_fraud_transactions_per_... 4
RGDT_Rule263 0.966102 1.000000 0.026928 0.982759 (X['account_number_num_fraud_transactions_per_... 6
RGDT_Rule137 0.991071 0.973684 0.025559 0.982301 ((X['account_number_avg_order_total_per_accoun... 6
RGDT_Rule272 0.965812 0.991228 0.026700 0.978355 ((X['account_number_avg_order_total_per_accoun... 5
[19]:
X_rules_test.head()
[19]:
Rule HighFraudTxnPerAccountNum RGDT_Rule195 RGDT_Rule263 RGDT_Rule137 RGDT_Rule272 RGDT_Rule193 RGDT_Rule112 RGDT_Rule24 RGDT_Rule45 RGDT_Rule153 RGDT_Rule313 RGDT_Rule35 RGDT_Rule256 RGDT_Rule241 RGDT_Rule65 RGDT_Rule162 RGDT_Rule2 ComplicatedRule RGDT_Rule81
eid
533-3553708-0918604 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
455-3498977-3144749 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
585-6596459-3918216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
685-6642742-5806657 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
956-2823525-9957253 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Plotting the performance uplift

We can visualise the performance uplift of the optimised rules using the .plot_performance_uplift() and .plot_performance_uplift_distribution() methods:

  • .plot_performance_uplift(): Generates a scatterplot showing the performance of each rule before and after optimisation.

  • .plot_performance_uplift_distribution(): Generates a boxplot showing the distribution of performance uplifts (original rules vs optimised rules).

On the training set

To visualise the uplift on the training set, we can use the class attributes orig_rule_performances and opt_rule_performances in the plotting methods, as these were generated as part of the optimisation process:

[20]:
ro.plot_performance_uplift(
    orig_rule_performances=ro.orig_rule_performances,
    opt_rule_performances=ro.opt_rule_performances,
    figsize=(10, 5)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_example_52_0.svg
[21]:
ro.plot_performance_uplift_distribution(
    orig_rule_performances=ro.orig_rule_performances,
    opt_rule_performances=ro.opt_rule_performances,
    figsize=(3, 7)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_example_53_0.svg

On the test set

To visualise the uplift on the test set, we first need to generate the orig_rule_performances and opt_rule_performances parameters used in the plotting methods as these aren’t created as part of the optimisation process. To do this, we need to apply both the original rules and the optimised rules to the test set. Note that before we apply the original rules, we need to remove those that either have no optimisable conditions, have zero variance features or have features that are missing in X_train:

[22]:
# Original rules
rules_to_exclude = ro.rule_names_missing_features + ro.rule_names_no_opt_conditions + ro.rule_names_zero_var_features
rules.filter_rules(exclude=rules_to_exclude)
orig_sys_rule_strings = rules.as_rule_strings(as_numpy=False)
orig_ra = RuleApplier(
    rule_strings=orig_sys_rule_strings,
    opt_func=f1.fit
)
_ = orig_ra.transform(X=X_test, y=y_test)
orig_rule_performances_test = orig_ra.rule_descriptions['OptMetric']
[23]:
# Optimised rules
_ = ro.transform(X=X_test, y=y_test)
opt_rule_performances_test = ro.rule_descriptions['OptMetric']
[24]:
ro.plot_performance_uplift(
    orig_rule_performances=orig_rule_performances_test,
    opt_rule_performances=opt_rule_performances_test,
    figsize=(10, 5)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_example_58_0.svg
[25]:
ro.plot_performance_uplift_distribution(
    orig_rule_performances=orig_rule_performances_test,
    opt_rule_performances=opt_rule_performances_test,
    figsize=(3, 7)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_example_59_0.svg