Rule Applier Spark Example¶
This notebook contains an example of how the Rule Applier can be used to apply Iguanas-readable rules to a dataset stored in Spark.
Requirements¶
To run, you’ll need the following:
A dataset containing the same features used in the rules.
Import packages¶
[1]:
from iguanas.rule_application import RuleApplier
from iguanas.metrics.classification import FScore
import databricks.koalas as ks
Read in data¶
Let’s read in some dummy data using Koalas, which implements the Pandas DataFrame API on top of Apache Spark:
[4]:
X = ks.read_csv(
'dummy_data/X_train.csv',
index_col='eid'
)
y = ks.read_csv(
'dummy_data/y_train.csv',
index_col='eid'
).squeeze()
Apply rules¶
Set up class parameters¶
Now we can set our class parameters for the Rule Applier. Here we’re specifying an additional metric to calculate for each rule (the F1 score). However, you can omit this if you just need to calculate the standard results (Precision, Recall and PercDataFlagged).
Please see the class docstring for more information on each parameter.
[5]:
fs = FScore(beta=1)
[6]:
params = {
'rule_strings': {
'Rule1': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1)",
'Rule2': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1)&(X['account_number_num_fraud_transactions_per_account_number_30day']>=1)",
'Rule3': "(X['account_number_num_fraud_transactions_per_account_number_1day']>=1)&(X['order_total']>50.87)"
},
'opt_func': fs.fit
}
Instantiate class and run¶
Once the parameters have been set, we can run the .transform() method to apply the list of rules to the dataset. Note that you can omit the y parameter if you have unlabelled data (however ensure that if you are providing an optimisation function to opt_func, it is not expecting a target column - see the optimisation_functions module for more information):
[7]:
ara = RuleApplier(**params)
X_rules = ara.transform(
X=X,
y=y,
sample_weight=None
)
/Users/jlaidler/venvs/iguanas_os_dev/lib/python3.8/site-packages/databricks/koalas/frame.py:11847: UserWarning: Koalas doesn't allow columns to be created via a new attribute name
warnings.warn(msg, UserWarning)
Outputs¶
The .transform() method returns a dataframe giving the binary columns of the rules as applied to the training dataset.
A useful attribute created by running the .transform() method (when the y parameter is given) is:
rule_descriptions: A dataframe showing the logic of the rules and their performance metrics as applied to the dataset.
[8]:
ara.rule_descriptions.head()
[8]:
Precision | Recall | PercDataFlagged | OptMetric | Logic | nConditions | |
---|---|---|---|---|---|---|
Rule | ||||||
Rule1 | 0.991837 | 1.000000 | 0.027547 | 0.995902 | (X['account_number_num_fraud_transactions_per_... | 1 |
Rule2 | 0.991837 | 1.000000 | 0.027547 | 0.995902 | (X['account_number_num_fraud_transactions_per_... | 2 |
Rule3 | 0.995851 | 0.987654 | 0.027097 | 0.991736 | (X['account_number_num_fraud_transactions_per_... | 2 |
[9]:
X_rules.head()
[9]:
Rule1 | Rule2 | Rule3 | |
---|---|---|---|
eid | |||
867-8837095-9305559 | 0 | 0 | 0 |
974-5306287-3527394 | 0 | 0 | 0 |
584-0112844-9158928 | 0 | 0 | 0 |
956-4190732-7014837 | 0 | 0 | 0 |
349-7005645-8862067 | 0 | 0 | 0 |