Direct Search Optimiser (unlabelled) Example

This notebook contains an example of how the Direct Search Optimiser module can be used to optimise the thresholds of an existing set of rules, given an unlabelled dataset, using Direct Search-type Optimisation algorithms.

Requirements

To run, you’ll need the following:

  • A rule set stored in the standard Iguanas lambda expression format, along with the keyword arguments for each lambda expression (more information on how to do this later)

  • An unlabelled dataset containing the features present in the rule set.


Import packages

[1]:
from iguanas.rule_optimisation import DirectSearchOptimiser
from iguanas.rules import Rules
from iguanas.rule_application import RuleApplier
from iguanas.metrics.unsupervised import AlertsPerDay

import pandas as pd
from sklearn.model_selection import train_test_split
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/var/folders/hw/y20855_146x8gbvbpqmtpdp00000gq/T/ipykernel_87238/537579477.py in <module>
      2 from iguanas.rules import Rules
      3 from iguanas.rule_application import RuleApplier
----> 4 from iguanas.metrics.classification import AlertsPerDay
      5
      6 import pandas as pd

ImportError: cannot import name 'AlertsPerDay' from 'iguanas.metrics.classification' (/Users/jlaidler/Documents/argo/iguanas/metrics/classification.py)

Read in data

Firstly, we need to read in the raw data containing the features:

[ ]:
data = pd.read_csv(
    'dummy_data/dummy_pipeline_output_data.csv',
    index_col='eid'
)

Then we can split the dataset into training and test sets:

[ ]:
X_train, X_test = train_test_split(
    data,
    test_size=0.33,
    random_state=0
)

Read in the rules

In this example, we’ll read in the rule conditions from a pickle file, where they are stored in the standard Iguanas string format. However, you can use any Iguanas-ready rule format - see the example notebook in the rules module.

[ ]:
import pickle
with open('dummy_data/rule_strings.pkl', 'rb') as f:
    rule_strings = pickle.load(f)

Now we can instantiate the Rules class with these rules:

[ ]:
rules = Rules(rule_strings=rule_strings)

We now need to convert the rules into the standard Iguanas lambda expression format. This format allows new threshold values to be injected into the rule condition before being evaluated - this is how the Bayesian Optimiser finds the optimal threshold values:

[ ]:
rule_lambdas = rules.as_rule_lambdas(
    as_numpy=False,
    with_kwargs=True
)

By converting the rule conditions to the standard Iguanas lambda expression format, we also generate a dictionary which gives the keyword arguments to each lambda expression (this dictionary is saved as the class attribute lambda_kwargs). Using these keyword arguments as inputs to the lambda expressions will convert them into the standard Iguanas string format.


Optimise rules

Set up class parameters

Now we can set our class parameters for the Direct Search Optimiser. Here we’re using the .fit() method from the AlertsPerDay class, which calculates the negative squared difference between the daily number of records a rule flags vs the targetted daily number of records flagged. This means that when the Direct Search Optimiser comes to maximise this metric, it will try to minimise the difference between the actual number of records flagged and the targetted number of records flagged.

See the metrics.unsupervised module for more information on additional optimisation functions that can be used on unlabelled datasets.

We’re also using the Nelder-Mead algorithm, which often benefits from setting the optional initial_simplex parameter. This is set through the options keyword argument. Here, we’ll generate the initial simplex of each rule using the create_initial_simplexes() method (but you can create your own if required).

Please see the class docstring for more information on each parameter.

[ ]:
apd = AlertsPerDay(
    n_alerts_expected_per_day=10,
    no_of_days_in_file=10
)
[ ]:
initial_simplexes = DirectSearchOptimiser.create_initial_simplexes(
    X=X_train,
    lambda_kwargs=rules.lambda_kwargs,
    shape='Minimum-based'
)
[ ]:
params = {
    'rule_lambdas': rule_lambdas,
    'lambda_kwargs': rules.lambda_kwargs,
    'opt_func': apd.fit,
    'method': 'Nelder-Mead',
    'options': initial_simplexes,
    'verbose': 1,
}

Instantiate class and run fit method

Once the parameters have been set, we can run the .fit() method to optimise the rules.

[ ]:
ro = DirectSearchOptimiser(**params)
X_rules = ro.fit(X=X_train)
--- Checking for rules with features that are missing in `X` ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 104445.32it/s]
--- Checking for rules that exclusively contain non-optimisable conditions ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 187952.30it/s]
--- Checking for rules that exclusively contain zero-variance features ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 188396.63it/s]
--- Optimising rules ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:04<00:00,  4.50it/s]

Outputs

The .fit() method returns a dataframe giving the binary columns of the optimised rules as applied to the training dataset. It also creates the following attributes:

  • rule_strings (dict): The optimised rules stored in the standard Iguanas string format (values) and their names (keys).

  • rule_descriptions (pd.Dataframe): A dataframe showing the logic of the rules and their performance metrics on the training dataset.

  • rule_names_missing_features (list): Names of rules which use features that are not present in the dataset (and therefore can’t be optimised or applied).

  • rule_names_no_opt_conditions (list): Names of rules which have no optimisable conditions (e.g.rules that only contain string-based conditions).

  • rule_names_zero_var_features (list): Names of rules which exclusively contain zero variance features (based on X), so cannot be optimised.

  • opt_rule_performances (dict): The optimisation metric (values) calculated for each optimised rule (keys) on the training set.

  • orig_rule_performances (dict): The optimisation metric (values) calculated for each original rule (keys) on the training set.

[ ]:
X_rules.head()
Rule ComplicatedRule RGDT_Rule24 RGDT_Rule162 RGDT_Rule45 RGDT_Rule241 RGDT_Rule65 RGDT_Rule153 RGDT_Rule81 RGDT_Rule313 RGDT_Rule195 RGDT_Rule263 HighFraudTxnPerAccountNum RGDT_Rule193 RGDT_Rule112 RGDT_Rule272 RGDT_Rule2 RGDT_Rule35 RGDT_Rule256 RGDT_Rule137
eid
503-0182982-0226911 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0
516-2441570-6696110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
475-5982298-4197297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
935-3613661-7862154 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
936-1684183-4550418 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[ ]:
ro.rule_strings
{'RGDT_Rule137': "((X['account_number_avg_order_total_per_account_number_7day']<=317.69)|(X['account_number_avg_order_total_per_account_number_7day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)&(X['account_number_num_fraud_transactions_per_account_number_90day']>=1.0)&((X['account_number_sum_order_total_per_account_number_1day']<=782.375)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
 'RGDT_Rule81': "(X['account_number_avg_order_total_per_account_number_1day']>738.732981198933)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=-4.163015047088265)&((X['account_number_num_order_items_per_account_number_30day']<=6.128187007270753)|(X['account_number_num_order_items_per_account_number_30day'].isna()))&((X['account_number_sum_order_total_per_account_number_1day']<=1373.4543524413368)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))",
 'HighFraudTxnPerAccountNum': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.75)",
 'RGDT_Rule256': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1.0)&((X['account_number_sum_order_total_per_account_number_90day']<=950.89999)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))&(X['is_billing_shipping_city_same']==False)",
 'RGDT_Rule35': "(X['account_number_num_fraud_transactions_per_account_number_7day']>=1.0)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=1.0)&((X['account_number_num_order_items_per_account_number_lifetime']<=5.0)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
 'RGDT_Rule193': "(X['account_number_num_distinct_transaction_per_account_number_30day']>=2.0)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=2.0)",
 'RGDT_Rule241': "(X['account_number_avg_order_total_per_account_number_90day']>372.08092163085934)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.48236083984375)&((X['account_number_sum_order_total_per_account_number_1day']<=1675.1407470703125)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>279.26189575195326)",
 'RGDT_Rule263': "(X['account_number_num_fraud_transactions_per_account_number_30day']>=1.8207588855320864)&((X['account_number_sum_order_total_per_account_number_1day']<=3132.383938328854)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_7day']>-1254.7089763169802)&((X['account_number_sum_order_total_per_account_number_90day']<=3722.8056488471575)|(X['account_number_sum_order_total_per_account_number_90day'].isna()))",
 'RGDT_Rule313': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=-0.8894707830305035)&(X['account_number_sum_order_total_per_account_number_30day']>112.08837892750427)&(X['account_number_sum_order_total_per_account_number_90day']>724.723801522922)&(X['is_existing_user']==True)",
 'RGDT_Rule195': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.2994370311498642)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=0.05196182429790497)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=1.2668327540159225)&(X['account_number_num_order_items_per_account_number_7day']>=-6.78845040500164)",
 'RGDT_Rule153': "(X['account_number_avg_order_total_per_account_number_90day']>189.93562499999996)&(X['account_number_num_fraud_transactions_per_account_number_1day']>=0.65625)&(X['account_number_num_order_items_per_account_number_lifetime']>=1.40625)&((X['order_total']<=762.8675000000001)|(X['order_total'].isna()))",
 'RGDT_Rule112': "(X['account_number_num_distinct_transaction_per_account_number_90day']>=2.0)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=1.0)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=2.0)&((X['account_number_num_order_items_per_account_number_90day']<=3.0)|(X['account_number_num_order_items_per_account_number_90day'].isna()))",
 'RGDT_Rule2': "((X['account_number_num_distinct_transaction_per_account_number_1day']<=2.0)|(X['account_number_num_distinct_transaction_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_7day']>=1.0)&((X['account_number_num_fraud_transactions_per_account_number_lifetime']<=1.0)|(X['account_number_num_fraud_transactions_per_account_number_lifetime'].isna()))&(X['is_existing_user']==True)",
 'RGDT_Rule65': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=-0.49147483710562434)&((X['account_number_sum_order_total_per_account_number_1day']<=774.8033227237656)|(X['account_number_sum_order_total_per_account_number_1day'].isna()))&(X['account_number_sum_order_total_per_account_number_30day']>269.9496511702673)&(X['is_billing_shipping_city_same']==False)",
 'RGDT_Rule45': "((X['account_number_avg_order_total_per_account_number_1day']<=1047.655216027734)|(X['account_number_avg_order_total_per_account_number_1day'].isna()))&(X['account_number_num_fraud_transactions_per_account_number_90day']>=0.8532665608288262)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.7877466345736625)&(X['account_number_sum_order_total_per_account_number_90day']>507.2614665028917)",
 'RGDT_Rule162': "(X['account_number_num_fraud_transactions_per_account_number_90day']>=0.5539376096811541)&(X['num_order_items']>=-0.520029032429635)&(X['order_total']>419.57446715300784)",
 'RGDT_Rule272': "((X['account_number_avg_order_total_per_account_number_90day']<=5.5)|(X['account_number_avg_order_total_per_account_number_90day'].isna()))&(X['account_number_num_distinct_transaction_per_account_number_30day']>=2.0)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=1.0)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=2.0)",
 'RGDT_Rule24': "(X['account_number_avg_order_total_per_account_number_90day']>-99.96428798675537)&(X['account_number_num_fraud_transactions_per_account_number_lifetime']>=0.6391334533691406)&((X['account_number_num_order_items_per_account_number_lifetime']<=13.066034317016602)|(X['account_number_num_order_items_per_account_number_lifetime'].isna()))&((X['account_number_sum_order_total_per_account_number_7day']<=210.42206211090092)|(X['account_number_sum_order_total_per_account_number_7day'].isna()))",
 'ComplicatedRule': "(X['num_order_items']>2.7158826779454355)&(((X['order_total']>589.7197587067517)&(X['is_existing_user']==True))|((X['order_total']>403.7855943263223)&(X['is_existing_user']==False))|(X['billing_city'].str.startswith('B', na=False)))"}
[ ]:
ro.rule_descriptions.head()
PercDataFlagged OptMetric Logic nConditions
Rule
ComplicatedRule 0.011244 -0.0 (X['num_order_items']>2.7158826779454355)&(((X... 6
RGDT_Rule24 0.011244 -0.0 (X['account_number_avg_order_total_per_account... 6
RGDT_Rule162 0.011244 -0.0 (X['account_number_num_fraud_transactions_per_... 3
RGDT_Rule45 0.011244 -0.0 ((X['account_number_avg_order_total_per_accoun... 5
RGDT_Rule241 0.011244 -0.0 (X['account_number_avg_order_total_per_account... 5
[ ]:
ro.opt_rule_performances
{'ComplicatedRule': -0.0,
 'RGDT_Rule24': -0.0,
 'RGDT_Rule162': -0.0,
 'RGDT_Rule45': -0.0,
 'RGDT_Rule241': -0.0,
 'RGDT_Rule65': -0.009999999999999929,
 'RGDT_Rule153': -0.009999999999999929,
 'RGDT_Rule81': -0.039999999999999716,
 'RGDT_Rule313': -0.6400000000000011,
 'RGDT_Rule195': -25.0,
 'RGDT_Rule263': -30.25,
 'HighFraudTxnPerAccountNum': -57.76,
 'RGDT_Rule193': -25.0,
 'RGDT_Rule112': -67.24,
 'RGDT_Rule272': -84.63999999999999,
 'RGDT_Rule2': -70.56,
 'RGDT_Rule35': -24.010000000000005,
 'RGDT_Rule256': -13.690000000000001,
 'RGDT_Rule137': -6.759999999999998}

Apply rules to a separate dataset

Use the .transform() method to apply the optimised rules to a separate dataset:

[ ]:
X_rules_applied = ro.transform(X=X_test)

Outputs

The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.

A useful attribute created by running the .transform() method is:

  • rule_descriptions: A dataframe showing the logic of the optimised rules and their performance metrics as applied to the given dataset.

[ ]:
ro.rule_descriptions.head()
PercDataFlagged OptMetric Logic nConditions
Rule
RGDT_Rule137 0.014833 -12.25 ((X['account_number_avg_order_total_per_accoun... 6
RGDT_Rule313 0.012551 -20.25 (X['account_number_num_fraud_transactions_per_... 4
RGDT_Rule153 0.012323 -21.16 (X['account_number_avg_order_total_per_account... 5
RGDT_Rule65 0.011639 -24.01 (X['account_number_num_fraud_transactions_per_... 5
RGDT_Rule81 0.011410 -25.00 (X['account_number_avg_order_total_per_account... 6
[ ]:
X_rules_applied.head()
Rule RGDT_Rule137 RGDT_Rule313 RGDT_Rule153 RGDT_Rule65 RGDT_Rule81 RGDT_Rule24 RGDT_Rule45 RGDT_Rule241 RGDT_Rule162 ComplicatedRule RGDT_Rule256 RGDT_Rule35 RGDT_Rule263 RGDT_Rule193 RGDT_Rule195 RGDT_Rule2 HighFraudTxnPerAccountNum RGDT_Rule112 RGDT_Rule272
eid
533-3553708-0918604 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
455-3498977-3144749 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
585-6596459-3918216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
685-6642742-5806657 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
956-2823525-9957253 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Plotting the performance uplift

We can visualise the performance uplift of the optimised rules using the .plot_performance_uplift() and .plot_performance_uplift_distribution() methods:

  • .plot_performance_uplift(): Generates a scatterplot showing the performance of each rule before and after optimisation.

  • .plot_performance_uplift_distribution(): Generates a boxplot showing the distribution of performance uplifts (original rules vs optimised rules).

On the training set

To visualise the uplift on the training set, we can use the class attributes orig_rule_performances and opt_rule_performances in the plotting methods, as these were generated as part of the optimisation process:

[ ]:
ro.plot_performance_uplift(
    orig_rule_performances=ro.orig_rule_performances,
    opt_rule_performances=ro.opt_rule_performances,
    figsize=(10, 5)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_unlabelled_example_49_0.png
[ ]:
ro.plot_performance_uplift_distribution(
    orig_rule_performances=ro.orig_rule_performances,
    opt_rule_performances=ro.opt_rule_performances,
    figsize=(3, 7)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_unlabelled_example_50_0.png

On the test set

To visualise the uplift on the test set, we first need to generate the orig_rule_performances and opt_rule_performances parameters used in the plotting methods as these aren’t created as part of the optimisation process. To do this, we need to apply both the original rules and the optimised rules to the test set. Note that before we apply the original rules, we need to remove those that either have no optimisable conditions, have zero variance features or have features that are missing in X_train:

[ ]:
# Original rules
rules_to_exclude = ro.rule_names_missing_features + ro.rule_names_no_opt_conditions + ro.rule_names_zero_var_features
rules.filter_rules(exclude=rules_to_exclude)
orig_sys_rule_strings = rules.as_rule_strings(as_numpy=False)
orig_ra = RuleApplier(
    rule_strings=orig_sys_rule_strings,
    opt_func=apd.fit
)
_ = orig_ra.transform(X=X_test)
orig_rule_performances_test = orig_ra.rule_descriptions['OptMetric']
[ ]:
# Optimised rules
_ = ro.transform(X=X_test)
opt_rule_performances_test = ro.rule_descriptions['OptMetric']
[ ]:
ro.plot_performance_uplift(
    orig_rule_performances=orig_rule_performances_test,
    opt_rule_performances=opt_rule_performances_test,
    figsize=(10, 5)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_unlabelled_example_55_0.png
[ ]:
ro.plot_performance_uplift_distribution(
    orig_rule_performances=orig_rule_performances_test,
    opt_rule_performances=opt_rule_performances_test,
    figsize=(3, 7)
)
../../_images/examples_rule_optimisation_direct_search_optimiser_unlabelled_example_56_0.png