Rules Example¶
This notebook contains an example of how the Rules class can be used to define a set of rules using one of two representations, and how we can switch between the different representations.
A rule set can be defined using one of the two following representations:
Dictionary representation - here each joining condition (defined using the key
condition
) and each rule condition (defined using the keysfeature
,operator
andvalue
) of each rule are defined in a dictionary.String representation - here each rule is defined using the Pandas syntax, stored as a string. Python’s built in eval() function can be used to evaluate the rule using this representation on a dataset.
We can also convert either of the above representations to the lambda expression format. This format allows different values to be injected into the rule string, which can then be evaluated on a dataset. This is very useful for when we optimise existing rules on a dataset.
Import packages¶
[1]:
from iguanas.rules import Rules
from iguanas.metrics.classification import FScore
import pandas as pd
import numpy as np
Create dummy dataset¶
[2]:
np.random.seed(0)
X = pd.DataFrame(
{
'payer_id_sum_approved_txn_amt_per_paypalid_1day': np.random.uniform(0, 1000, 1000),
'payer_id_sum_approved_txn_amt_per_paypalid_7day': np.random.uniform(0, 7000, 1000),
'payer_id_sum_approved_txn_amt_per_paypalid_30day': np.random.uniform(0, 30000, 1000),
'num_items': np.random.randint(0, 10, 1000),
'ml_cc_v0': np.random.uniform(0, 1, 1000),
'method_clean': ['checkout', 'login', 'bad_login', 'bad_checkout', 'fraud_login', 'fraud_checkout', 'signup', 'bad_signup', 'fraud_signup', np.nan] * 100,
'ip_address': ['192.168.0.1', np.nan] * 500,
'ip_isp': ['BT', np.nan, '', ''] * 250
}
)
y = pd.Series(np.random.randint(0, 2, 1000))
Instantiate Rule class¶
Using dictionary representation¶
As mentioned above, we can define a rule set using one of the two following representations - Dictionary or String.
Let’s first define a set of rules using the dictionary representation:
[3]:
rule_dicts = {
'Rule1': {
'condition': 'AND',
'rules': [{
'condition': 'OR',
'rules': [{
'field': 'payer_id_sum_approved_txn_amt_per_paypalid_1day',
'operator': 'greater_or_equal',
'value': 60.0
},
{
'field': 'payer_id_sum_approved_txn_amt_per_paypalid_7day',
'operator': 'greater',
'value': 120.0
},
{
'field': 'payer_id_sum_approved_txn_amt_per_paypalid_30day',
'operator': 'less_or_equal',
'value': 500.0
}
]},
{
'field': 'num_items',
'operator': 'equal',
'value': 1.0
}
]},
'Rule2': {
'condition': 'AND',
'rules': [{
'field': 'ml_cc_v0',
'operator': 'less',
'value': 0.315
},
{
'condition': 'OR',
'rules': [{
'field': 'method_clean',
'operator': 'equal',
'value': 'checkout'
},
{
'field': 'method_clean',
'operator': 'begins_with',
'value': 'checkout'
},
{
'field': 'method_clean',
'operator': 'ends_with',
'value': 'checkout'
},
{
'field': 'method_clean',
'operator': 'contains',
'value': 'checkout'
},
{
'field': 'ip_address',
'operator': 'is_not_null',
'value': None
},
{
'field': 'ip_isp',
'operator': 'is_not_empty',
'value': None
}]
}]
}
}
Now that we have defined our rule set using the dictionary representation, we can instantiate the Rule class.
[4]:
rules = Rules(rule_dicts=rule_dicts)
Once the class is instantiated, we can switch to the string representation using the as_rule_strings() method:
[5]:
rule_strings = rules.as_rule_strings(as_numpy=False)
Outputs¶
The .as_rule_strings() method returns a dictionary of the set of rules defined using the standard Iguanas string format (values) and their names (keys). It also saves this dictionary as the class attribute rule_strings.
[6]:
rule_strings
[6]:
{'Rule1': "((X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)|(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)|(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0))&(X['num_items']==1.0)",
'Rule2': "(X['ml_cc_v0']<0.315)&((X['method_clean']=='checkout')|(X['method_clean'].str.startswith('checkout', na=False))|(X['method_clean'].str.endswith('checkout', na=False))|(X['method_clean'].str.contains('checkout', na=False, regex=False))|(~X['ip_address'].isna())|(X['ip_isp'].fillna('')!=''))"}
Using string representation¶
Now let’s instead define the same set of rules using the string representation:
[7]:
rule_strings = {
'Rule1': "((X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)|(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)|(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0))&(X['num_items']==1.0)",
'Rule2': "(X['ml_cc_v0']<0.315)&((X['method_clean']=='checkout')|(X['method_clean'].str.startswith('checkout', na=False))|(X['method_clean'].str.endswith('checkout', na=False))|(X['method_clean'].str.contains('checkout', na=False))|(~X['ip_address'].isna())|(X['ip_isp'].fillna('')!=''))"
}
Now that we have defined our rule set using the string representation, we can instantiate the Rule class.
[8]:
rules = Rules(rule_strings=rule_strings)
Once the class is instantiated, we can switch to the dictionary representation using the as_rule_dicts() method:
[9]:
rule_dicts = rules.as_rule_dicts()
Outputs¶
The .as_rule_dicts() method returns a dictionary of the set of rules defined using the standard Iguanas dictionary format (values) and their names (keys). It also saves this dictionary as the class attribute rule_dicts.
[10]:
rule_dicts
[10]:
{'Rule1': {'condition': 'AND',
'rules': [{'condition': 'OR',
'rules': [{'field': 'payer_id_sum_approved_txn_amt_per_paypalid_1day',
'operator': 'greater_or_equal',
'value': 60.0},
{'field': 'payer_id_sum_approved_txn_amt_per_paypalid_7day',
'operator': 'greater',
'value': 120.0},
{'field': 'payer_id_sum_approved_txn_amt_per_paypalid_30day',
'operator': 'less_or_equal',
'value': 500.0}]},
{'field': 'num_items', 'operator': 'equal', 'value': 1.0}]},
'Rule2': {'condition': 'AND',
'rules': [{'field': 'ml_cc_v0', 'operator': 'less', 'value': 0.315},
{'condition': 'OR',
'rules': [{'field': 'method_clean',
'operator': 'equal',
'value': 'checkout'},
{'field': 'method_clean', 'operator': 'begins_with', 'value': 'checkout'},
{'field': 'method_clean', 'operator': 'ends_with', 'value': 'checkout'},
{'field': 'method_clean', 'operator': 'contains', 'value': 'checkout'},
{'field': 'ip_address', 'operator': 'is_not_null', 'value': None},
{'field': 'ip_isp', 'operator': 'is_not_empty', 'value': None}]}]}}
Converting to lambda expressions¶
Once a rule set has been defined using one of the two representations, it can be converted to the lambda expression format. This format allows different values to be injected into the rule string, which can then be evaluated on a dataset. This is very useful for when we optimise existing rules on a dataset.
We can use the above instantiated Rules class along with the .as_rule_lambdas() method to convert the rules to the lambda expression format. The lambda expressions can be created such that they receive either keyword arguments as inputs, or positional arguments as inputs.
with_kwargs = True¶
Let’s first convert the rule set to lambda expressions that receive keyword arguments as inputs:
[11]:
rule_lambdas = rules.as_rule_lambdas(
as_numpy=False,
with_kwargs=True
)
Outputs¶
The .as_rule_lambdas() method returns a dictionary of the set of rules defined using the standard Iguanas lambda expression format (values) and their names (keys). It also saves this dictionary as the class attribute rule_lambdas.
Three useful attributes created by running the .as_rule_lambdas() method are:
lambda_kwargs (dict): For each rule (keys), a dictionary containing the features used in the rule (keys) and the current values (values). Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=True.
lambda_args (dict): For each rule (keys), a list containing the current values used in the rule. Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=False.
rule_features (dict): For each rule (keys), a list containing the features used in the rule. Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=False.
[12]:
rule_lambdas
[12]:
{'Rule1': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
'Rule2': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>}
[13]:
rules.lambda_kwargs
[13]:
{'Rule1': {'payer_id_sum_approved_txn_amt_per_paypalid_1day': 60.0,
'payer_id_sum_approved_txn_amt_per_paypalid_7day': 120.0,
'payer_id_sum_approved_txn_amt_per_paypalid_30day': 500.0,
'num_items': 1.0},
'Rule2': {'ml_cc_v0': 0.315}}
Across both rules, we have the following features: payer_id_sum_approved_txn_amt_per_paypalid_1day, payer_id_sum_approved_txn_amt_per_paypalid_7day, payer_id_sum_approved_txn_amt_per_paypalid_30day, num_items, ml_cc_v0, method_clean, ip_address and ip_isp.
A few points to note:
When the same feature is used more than once in a given rule, a suffix with the format ‘%<n>’ will be added, where n is a counter used to distinguish the conditions.
The values of some of these features cannot be changed, since the conditions related to these features do not have values - ‘ip_address’ is checked for nulls and ‘ip_isp’ is checked for empty cells. These are omitted from the lambda_kwargs class attribute (as seen above).
So we can construct a dictionary for each rule, for the features whose values can be changed. The keys of each dictionary are the features, with the values being the new values that we want to try in the rule:
[14]:
new_values = {
'Rule1': {
'payer_id_sum_approved_txn_amt_per_paypalid_1day': 100.0,
'payer_id_sum_approved_txn_amt_per_paypalid_7day': 200.0,
'payer_id_sum_approved_txn_amt_per_paypalid_30day': 600.0,
'num_items': 2.0
},
'Rule2': {
'ml_cc_v0': 0.5,
'method_clean': 'login',
'method_clean%0': 'bad_',
'method_clean%1': '_bad',
'method_clean%2': 'fraud'
}
}
Then we can loop through the rules, inject the new values into the lambda expression and evaluate it (with the new values) on the dataset:
[15]:
X_rules = {}
for rule_name, rule_lambda in rules.rule_lambdas.items():
new_values_for_rule = new_values[rule_name]
X_rules[rule_name] = eval(rule_lambda(**new_values_for_rule))
X_rules = pd.DataFrame(X_rules, index=X.index)
[16]:
X_rules.sum()
[16]:
Rule1 99
Rule2 352
dtype: int64
We can also use the lambda_kwargs class attribute to inject the original values into the lambda expression and evaluate it on the dataset:
[17]:
X_rules = {}
for rule_name, rule_lambda in rules.rule_lambdas.items():
X_rules[rule_name] = eval(rule_lambda(**rules.lambda_kwargs[rule_name]))
X_rules = pd.DataFrame(X_rules, index=X.index)
[18]:
X_rules.sum()
[18]:
Rule1 96
Rule2 233
dtype: int64
with_kwargs = False¶
Now let’s convert the rule set to lambda expressions that receive positions arguments as inputs:
[19]:
rule_lambdas = rules.as_rule_lambdas(
as_numpy=False,
with_kwargs=False
)
Outputs¶
The .as_rule_lambdas() method returns a dictionary of the set of rules defined using the standard Iguanas lambda expression format (values) and their names (keys). It also saves this dictionary as the class attribute rule_lambdas.
Three useful attributes created by running the .as_rule_lambdas() method are:
lambda_kwargs (dict): For each rule (keys), a dictionary containing the features used in the rule (keys) and the current values (values). Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=True.
lambda_args (dict): For each rule (keys), a list containing the current values used in the rule. Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=False.
rule_features (dict): For each rule (keys), a list containing the features used in the rule. Only populates when .as_rule_lambdas() is used with the keyword argument with_kwargs=False.
[20]:
rule_lambdas
[20]:
{'Rule1': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(*args)>,
'Rule2': <function iguanas.rules.convert_rule_dicts_to_rule_strings.ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(*args)>}
[21]:
rules.lambda_args
[21]:
{'Rule1': [60.0, 120.0, 500.0, 1.0], 'Rule2': [0.315]}
[22]:
rules.rule_features
[22]:
{'Rule1': ['payer_id_sum_approved_txn_amt_per_paypalid_1day',
'payer_id_sum_approved_txn_amt_per_paypalid_7day',
'payer_id_sum_approved_txn_amt_per_paypalid_30day',
'num_items'],
'Rule2': ['ml_cc_v0']}
Across both rules, we have the following features: payer_id_sum_approved_txn_amt_per_paypalid_1day, payer_id_sum_approved_txn_amt_per_paypalid_7day, payer_id_sum_approved_txn_amt_per_paypalid_30day, num_items, ml_cc_v0, method_clean, ip_address and ip_isp.
Note that the values of some of these features cannot be changed, since the conditions related to these features do not have values - ‘ip_address’ is checked for nulls and ‘ip_isp’ is checked for empty cells. These are omitted from the lambda_args class attribute (as seen above).
So we can construct a list for each rule, for the features whose values can be changed. The values of the list are new values that we want to try in the rules. We can use the rule_features class attribute to ensure we use the correct order:
[23]:
new_values = {
'Rule1': [100.0, 200.0, 600.0, 2.0],
'Rule2': [0.5, 'login', 'bad_', '_bad', 'fraud']
}
Then we can loop through the rules, inject the new values into the lambda expression and evaluate it (with the new values) on the dataset:
[24]:
X_rules = {}
for rule_name, rule_lambda in rules.rule_lambdas.items():
new_values_for_rule = new_values[rule_name]
X_rules[rule_name] = eval(rule_lambda(*new_values_for_rule))
X_rules = pd.DataFrame(X_rules, index=X.index)
[25]:
X_rules.sum()
[25]:
Rule1 99
Rule2 352
dtype: int64
We can also use the lambda_args class attribute to inject the original values into the lambda expression and evaluate it on the dataset:
[26]:
X_rules = {}
for rule_name, rule_lambda in rules.rule_lambdas.items():
X_rules[rule_name] = eval(rule_lambda(*rules.lambda_args[rule_name]))
X_rules = pd.DataFrame(X_rules, index=X.index)
[27]:
X_rules.sum()
[27]:
Rule1 96
Rule2 233
dtype: int64
Filtering rules¶
We can use the .filter_rules() method to filter a ruleset based on their names. Let’s say we define the following rule set, which consists of three rules:
[28]:
rule_strings = {
'Rule1': "(X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)",
'Rule2': "(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)",
'Rule3': "(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0)"
}
[29]:
rules = Rules(rule_strings=rule_strings)
[30]:
rules.rule_strings
[30]:
{'Rule1': "(X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)",
'Rule2': "(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)",
'Rule3': "(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0)"}
Now we can filter the rule set to include or exclude those rules stated:
[31]:
rules.filter_rules(
include=['Rule1'],
exclude=None
)
Outputs¶
The .filter_rules() method does not return a value, however it does filter the rules within the class, based on the rules that were included or excluded:
[32]:
rules.rule_strings
[32]:
{'Rule1': "(X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)"}
Returning the features in each rule¶
We can use the .get_rule_features() method to return the unique set of features related to each rule. Let’s say we define the following rule set, which consists of three rules:
[33]:
rule_strings = {
'Rule1': "(X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)&(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)",
'Rule2': "(X['num_order_items']>20)|((X['payer_id_sum_approved_txn_amt_per_paypalid_7day']<100.0)&(X['num_order_items']>10))",
'Rule3': "(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0)"
}
[34]:
rules = Rules(rule_strings=rule_strings)
Now we can return the unique set of features related to each rule:
[35]:
rule_features = rules.get_rule_features()
Outputs¶
The .get_rule_features() method return a dictionary of the unique set of features (values) related to each rule (keys):
[36]:
rule_features
[36]:
{'Rule1': {'payer_id_sum_approved_txn_amt_per_paypalid_1day',
'payer_id_sum_approved_txn_amt_per_paypalid_7day'},
'Rule2': {'num_order_items',
'payer_id_sum_approved_txn_amt_per_paypalid_7day'},
'Rule3': {'payer_id_sum_approved_txn_amt_per_paypalid_30day'}}
Applying rules to a dataset¶
Use the .transform() method to apply the rules to a dataset:
[37]:
rule_strings = {
'Rule1': "(X['payer_id_sum_approved_txn_amt_per_paypalid_1day']>=60.0)&(X['payer_id_sum_approved_txn_amt_per_paypalid_7day']>120.0)",
'Rule2': "(X['num_items']>20)|((X['payer_id_sum_approved_txn_amt_per_paypalid_7day']<100.0)&(X['num_items']>10))",
'Rule3': "(X['payer_id_sum_approved_txn_amt_per_paypalid_30day']<=500.0)"
}
[38]:
f1 = FScore(beta=1)
[39]:
rules = Rules(rule_strings=rule_strings, opt_func=f1.fit)
[40]:
X_rules = rules.transform(X=X, y=y)
Outputs¶
The .transform() method returns a dataframe giving the binary columns of the rules as applied to the given dataset.
A useful attribute created by running the .transform() method is:
rule_descriptions: A dataframe showing the logic of the rules and their performance metrics as applied to the given dataset.
[41]:
X_rules.head()
[41]:
Rule | Rule1 | Rule3 | Rule2 |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
[42]:
rules.rule_descriptions
[42]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
Rule1 | 0.5 | 0.920477 | 0.926 | 0.648006 | (X['payer_id_sum_approved_txn_amt_per_paypalid... | 2 |
Rule3 | 0.4 | 0.011928 | 0.015 | 0.023166 | (X['payer_id_sum_approved_txn_amt_per_paypalid... | 1 |
Rule2 | 0.0 | 0.000000 | 0.000 | 0.000000 | (X['num_items']>20)|((X['payer_id_sum_approved... | 3 |