Direct Search Optimiser Example¶
This notebook contains an example of how the Direct Search Optimiser module can be used to optimise the thresholds of an existing set of rules, given a labelled dataset, using Direct Search-type Optimisation algorithms.
Requirements¶
To run, you’ll need the following:
A rule set stored in the standard Iguanas lambda expression format, along with the keyword arguments for each lambda expression (more information on how to do this later)
A labelled dataset containing the features present in the rule set.
Import packages¶
[1]:
from iguanas.rule_optimisation import DirectSearchOptimiser
from iguanas.rules import Rules
from iguanas.rule_application import RuleApplier
from iguanas.metrics.classification import FScore
import pandas as pd
from sklearn.model_selection import train_test_split
Read in data¶
Firstly, we need to read in the raw data containing the features and the fraud label:
[2]:
data = pd.read_csv(
'dummy_data/dummy_pipeline_output_data.csv',
index_col='eid'
)
Then we need to split out the dataset into the features (X) and the target column (y):
[3]:
fraud_column = 'sim_is_fraud'
X = data.drop(fraud_column, axis=1)
y = data[fraud_column]
Finally, we can split the features and target column into training and test sets:
[4]:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.33,
random_state=0
)
Read in the rules¶
In this example, we’ll read in the rule conditions from a pickle file, where they are stored in the standard Iguanas string format. However, you can use any Iguanas-ready rule format - see the example notebook in the rules module.
[5]:
import pickle
with open('dummy_data/rule_strings.pkl', 'rb') as f:
rule_strings = pickle.load(f)
Now we can instantiate the Rules class with these rules:
[6]:
rules = Rules(rule_strings=rule_strings)
We now need to convert the rules into the standard Iguanas lambda expression format. This format allows new threshold values to be injected into the rule condition before being evaluated - this is how the Bayesian Optimiser finds the optimal threshold values:
[7]:
rule_lambdas = rules.as_rule_lambdas(
as_numpy=False,
with_kwargs=True
)
By converting the rule conditions to the standard Iguanas lambda expression format, we also generate a dictionary which gives the keyword arguments to each lambda expression (this dictionary is saved as the class attribute lambda_kwargs). Using these keyword arguments as inputs to the lambda expressions will convert them into the standard Iguanas string format.
Optimise rules¶
Set up class parameters¶
Now we can set our class parameters for the Direct Search Optimiser.
Here we’re using the F1 score as the optimisation function (you can choose a different function from the metrics module or create your own - see the classification.ipynb example notebook in the metrics module).
Note that if you’re using the FScore, Precision or Recall score as the optimisation function, use the FScore, Precision or Recall classes in the metrics.classification module rather than the same functions from Sklearn’s metrics module, as the former are ~100 times faster on larger datasets.
We’re also using the Nelder-Mead algorithm, which often benefits from setting the optional initial_simplex parameter. This is set through the options keyword argument. Here, we’ll generate the initial simplex of each rule using the create_initial_simplexes() method (but you can create your own if required).
Please see the class docstring for more information on each parameter.
[8]:
f1 = FScore(beta=1)
[9]:
initial_simplexes = DirectSearchOptimiser.create_initial_simplexes(
X=X_train,
lambda_kwargs=rules.lambda_kwargs,
shape='Minimum-based'
)
[10]:
params = {
'rule_lambdas': rule_lambdas,
'lambda_kwargs': rules.lambda_kwargs,
'opt_func': f1.fit,
'method': 'Nelder-Mead',
'options': initial_simplexes,
'verbose': 1,
}
Instantiate class and run fit method¶
Once the parameters have been set, we can run the .fit() method to optimise the rules.
[11]:
ro = DirectSearchOptimiser(**params)
[12]:
X_rules = ro.fit(
X=X_train,
y=y_train,
sample_weight=None
)
--- Checking for rules with features that are missing in `X` ---
100%|██████████| 19/19 [00:00<00:00, 44922.08it/s]
--- Checking for rules that exclusively contain non-optimisable conditions ---
100%|██████████| 19/19 [00:00<00:00, 51747.91it/s]
--- Checking for rules that exclusively contain zero-variance features ---
100%|██████████| 19/19 [00:00<00:00, 162305.04it/s]
--- Optimising rules ---
100%|██████████| 19/19 [00:06<00:00, 2.90it/s]
Outputs¶
The .fit() method returns a dataframe giving the binary columns of the optimised rules as applied to the training dataset. It also creates the following attributes:
rule_strings (dict): The optimised rules stored in the standard Iguanas string format (values) and their names (keys).
rule_descriptions (pd.Dataframe): A dataframe showing the logic of the rules and their performance metrics on the training dataset.
rule_names_missing_features (list): Names of rules which use features that are not present in the dataset (and therefore can’t be optimised or applied).
rule_names_no_opt_conditions (list): Names of rules which have no optimisable conditions (e.g.rules that only contain string-based conditions).
rule_names_zero_var_features (list): Names of rules which exclusively contain zero variance features (based on
X
), so cannot be optimised.opt_rule_performances (dict): The optimisation metric (values) calculated for each optimised rule (keys) on the training set.
orig_rule_performances (dict): The optimisation metric (values) calculated for each original rule (keys) on the training set.
[13]:
X_rules.head()
[13]:
Rule | HighFraudTxnPerAccountNum | RGDT_Rule195 | RGDT_Rule263 | RGDT_Rule272 | RGDT_Rule45 | RGDT_Rule193 | RGDT_Rule112 | RGDT_Rule137 | RGDT_Rule24 | RGDT_Rule153 | RGDT_Rule313 | RGDT_Rule35 | RGDT_Rule241 | RGDT_Rule162 | RGDT_Rule65 | ComplicatedRule | RGDT_Rule256 | RGDT_Rule2 | RGDT_Rule81 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
eid | |||||||||||||||||||
503-0182982-0226911 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
516-2441570-6696110 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
475-5982298-4197297 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
935-3613661-7862154 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
936-1684183-4550418 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
[14]:
ro.rule_strings
[14]:
{'RGDT_Rule137': "((X['account_number_avg_order_total_per_account_number_7day']<=1319.4598177719115)|(X['account_number_avg_order_total_per_account_number_7day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.05990791320800781)&(X['account_number_num_fraud_transactions_per_account_number_90day']>=-4.349979400634766)&((X['account_number_sum_order_total_per_account_number_1day']<=1828.5315874099724)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
'RGDT_Rule81': "(X['account_number_avg_order_total_per_account_number_1day']>617.69)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=1.0)&((X['account_number_num_order_items_per_account_number_30day']<=3.0)|(X['account_number_num_order_items_per_account_number_30day'].isna()))&((X['account_number_sum_order_total_per_account_number_1day']<=627.51999)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
'HighFraudTxnPerAccountNum': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)",
'RGDT_Rule256': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)&((X['account_number_sum_order_total_per_account_number_90day']<=950.89999)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))&(X['is_billing_shipping_city_same']==False)",
'RGDT_Rule35': "(X['account_number_num_fraud_transactions_per_account_number_7day']>=0.5185185185185187)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=-7.777777777777779)&((X['account_number_num_order_items_per_account_number_lifetime']<=27.777777777777782)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
'RGDT_Rule193': "(X['account_number_num_distinct_transaction_per_account_number_30day']>=-2.515625)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.8203125)",
'RGDT_Rule241': "(X['account_number_avg_order_total_per_account_number_90day']>322.34253616608135)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=-1.3512841817262844)&((X['account_number_sum_order_total_per_account_number_1day']<=2569.17933672239)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>570.010761076064)",
'RGDT_Rule263': "(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.1640625)&((X['account_number_sum_order_total_per_account_number_1day']<=4216.703906250001)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_7day']>-4464.745312499999)&((X['account_number_sum_order_total_per_account_number_90day']<=6697.117968749999)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))",
'RGDT_Rule313': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.9541441152207949)&(X['account_number_sum_order_total_per_account_number_30day']>-18.92841299541044)&(X['account_number_sum_order_total_per_account_number_90day']>-854.424793523272)&(X['is_existing_user']==True)",
'RGDT_Rule195': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.45969962351955473)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.697719914605841)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.7697551276069134)&(X['account_number_num_order_items_per_account_number_7day']>=-0.7995512732304633)",
'RGDT_Rule153': "(X['account_number_avg_order_total_per_account_number_90day']>-369.66274632096327)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.7715035974979401)&(X['account_number_num_order_items_per_account_number_lifetime']>=1.7915962636470795)&((X['order_total']<=1479.2942406350376)|(X['order_total'].isna()))",
'RGDT_Rule112': "(X['account_number_num_distinct_transaction_per_account_number_90day']>=-1.7096405029296875)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=-3.7865753173828125)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.07733154296875)&((X['account_number_num_order_items_per_account_number_90day']<=15.665817260742188)|(X['account_number_num_order_items_per_account_number_90day'].isna()))",
'RGDT_Rule2': "((X['account_number_num_distinct_transaction_per_account_number_1day']<=2.0)|(X['account_number_num_distinct_transaction_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_7day']>=1.0)&((X['account_number_num_fraud_transactions_per_account_number_lifetime']<=1.0)|(X['account_number_num_fraud_transactions_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
'RGDT_Rule65': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=-2.031659691858139)&((X['account_number_sum_order_total_per_account_number_1day']<=3245.308425296317)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>421.71244916248713)&(X['is_billing_shipping_city_same']==False)",
'RGDT_Rule45': "((X['account_number_avg_order_total_per_account_number_1day']<=2698.668671875)|(X['account_number_avg_order_total_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_90day']>=0.90234375)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.08203125)&(X['account_number_sum_order_total_per_account_number_90day']>-1956.7710937499999)",
'RGDT_Rule162': "(X['account_number_num_fraud_transactions_per_account_number_90day']>=-2.23265606791473)&(X['num_order_items']>=0.5197764325298163)&(X['order_total']>622.0727832008924)",
'RGDT_Rule272': "((X['account_number_avg_order_total_per_account_number_90day']<=2728.116280963719)|(X['account_number_avg_order_total_per_account_number_90day'].isna()))&(X['account_number_num_distinct_transaction_per_account_number_30day']>=-7.8742640763521194)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.719363197684288)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.980542927980423)",
'RGDT_Rule24': "(X['account_number_avg_order_total_per_account_number_90day']>-1078.024769393904)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.6513598936071503)&((X['account_number_num_order_items_per_account_number_lifetime']<=13.459071105062321)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&((X['account_number_sum_order_total_per_account_number_7day']<=3794.9691484239675)|(X['account_number_sum_order_total_per_account_number_7day'].isna()))",
'ComplicatedRule': "(X['num_order_items']>0.0)&(((X['order_total']>2025.98)&(X['is_existing_user']==True))|((X['order_total']>5.0)&(X['is_existing_user']==False))|(X['billing_city'].str.startswith('B', na=False)))"}
[15]:
ro.rule_descriptions.head()
[15]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
HighFraudTxnPerAccountNum | 0.991903 | 1.000000 | 0.027772 | 0.995935 | (X['account_number_num_fraud_transactions_per_... | 1 |
RGDT_Rule195 | 0.991870 | 0.995918 | 0.027659 | 0.993890 | (X['account_number_num_fraud_transactions_per_... | 4 |
RGDT_Rule263 | 0.968379 | 1.000000 | 0.028446 | 0.983936 | (X['account_number_num_fraud_transactions_per_... | 6 |
RGDT_Rule272 | 0.968254 | 0.995918 | 0.028334 | 0.981891 | ((X['account_number_avg_order_total_per_accoun... | 5 |
RGDT_Rule45 | 0.953125 | 0.995918 | 0.028783 | 0.974052 | ((X['account_number_avg_order_total_per_accoun... | 5 |
[16]:
ro.opt_rule_performances
[16]:
{'HighFraudTxnPerAccountNum': 0.9959349593495934,
'RGDT_Rule195': 0.9938900203665988,
'RGDT_Rule263': 0.9839357429718876,
'RGDT_Rule272': 0.9818913480885312,
'RGDT_Rule45': 0.9740518962075848,
'RGDT_Rule193': 0.9740518962075848,
'RGDT_Rule112': 0.9740518962075848,
'RGDT_Rule137': 0.972972972972973,
'RGDT_Rule24': 0.972,
'RGDT_Rule153': 0.8376470588235295,
'RGDT_Rule313': 0.4217252396166134,
'RGDT_Rule35': 0.4217252396166134,
'RGDT_Rule241': 0.40888888888888886,
'RGDT_Rule162': 0.39378238341968913,
'RGDT_Rule65': 0.33532934131736525,
'ComplicatedRule': 0.20890688259109313,
'RGDT_Rule256': 0.40909090909090906,
'RGDT_Rule2': 0.12260536398467435,
'RGDT_Rule81': 0.0632411067193676}
Apply rules to a separate dataset¶
Use the .transform() method to apply the optimised rules to a separate dataset:
[17]:
X_rules_test = ro.transform(
X=X_test,
y=y_test,
sample_weight=None
)
Outputs¶
The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.
A useful attribute created by running the .transform() method is:
rule_descriptions: A dataframe showing the logic of the optimised rules and their performance metrics as applied to the given dataset.
[18]:
ro.rule_descriptions.head()
[18]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
HighFraudTxnPerAccountNum | 0.991304 | 1.000000 | 0.026244 | 0.995633 | (X['account_number_num_fraud_transactions_per_... | 1 |
RGDT_Rule195 | 0.991228 | 0.991228 | 0.026016 | 0.991228 | (X['account_number_num_fraud_transactions_per_... | 4 |
RGDT_Rule263 | 0.966102 | 1.000000 | 0.026928 | 0.982759 | (X['account_number_num_fraud_transactions_per_... | 6 |
RGDT_Rule137 | 0.991071 | 0.973684 | 0.025559 | 0.982301 | ((X['account_number_avg_order_total_per_accoun... | 6 |
RGDT_Rule272 | 0.965812 | 0.991228 | 0.026700 | 0.978355 | ((X['account_number_avg_order_total_per_accoun... | 5 |
[19]:
X_rules_test.head()
[19]:
Rule | HighFraudTxnPerAccountNum | RGDT_Rule195 | RGDT_Rule263 | RGDT_Rule137 | RGDT_Rule272 | RGDT_Rule193 | RGDT_Rule112 | RGDT_Rule24 | RGDT_Rule45 | RGDT_Rule153 | RGDT_Rule313 | RGDT_Rule35 | RGDT_Rule256 | RGDT_Rule241 | RGDT_Rule65 | RGDT_Rule162 | RGDT_Rule2 | ComplicatedRule | RGDT_Rule81 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
eid | |||||||||||||||||||
533-3553708-0918604 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
455-3498977-3144749 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
585-6596459-3918216 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
685-6642742-5806657 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
956-2823525-9957253 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Plotting the performance uplift¶
We can visualise the performance uplift of the optimised rules using the .plot_performance_uplift() and .plot_performance_uplift_distribution() methods:
.plot_performance_uplift(): Generates a scatterplot showing the performance of each rule before and after optimisation.
.plot_performance_uplift_distribution(): Generates a boxplot showing the distribution of performance uplifts (original rules vs optimised rules).
On the training set¶
To visualise the uplift on the training set, we can use the class attributes orig_rule_performances and opt_rule_performances in the plotting methods, as these were generated as part of the optimisation process:
[20]:
ro.plot_performance_uplift(
orig_rule_performances=ro.orig_rule_performances,
opt_rule_performances=ro.opt_rule_performances,
figsize=(10, 5)
)
[21]:
ro.plot_performance_uplift_distribution(
orig_rule_performances=ro.orig_rule_performances,
opt_rule_performances=ro.opt_rule_performances,
figsize=(3, 7)
)
On the test set¶
To visualise the uplift on the test set, we first need to generate the orig_rule_performances and opt_rule_performances parameters used in the plotting methods as these aren’t created as part of the optimisation process. To do this, we need to apply both the original rules and the optimised rules to the test set. Note that before we apply the original rules, we need to remove those that either have no optimisable conditions, have zero variance features or have features that are missing in X_train:
[22]:
# Original rules
rules_to_exclude = ro.rule_names_missing_features + ro.rule_names_no_opt_conditions + ro.rule_names_zero_var_features
rules.filter_rules(exclude=rules_to_exclude)
orig_sys_rule_strings = rules.as_rule_strings(as_numpy=False)
orig_ra = RuleApplier(
rule_strings=orig_sys_rule_strings,
opt_func=f1.fit
)
_ = orig_ra.transform(X=X_test, y=y_test)
orig_rule_performances_test = orig_ra.rule_descriptions['OptMetric']
[23]:
# Optimised rules
_ = ro.transform(X=X_test, y=y_test)
opt_rule_performances_test = ro.rule_descriptions['OptMetric']
[24]:
ro.plot_performance_uplift(
orig_rule_performances=orig_rule_performances_test,
opt_rule_performances=opt_rule_performances_test,
figsize=(10, 5)
)
[25]:
ro.plot_performance_uplift_distribution(
orig_rule_performances=orig_rule_performances_test,
opt_rule_performances=opt_rule_performances_test,
figsize=(3, 7)
)