Unsupervised Metrics Spark Example¶
This notebook contains an example of how unsupervised metrics can be applied to a dataset, how they can be used in other Iguanas modules and how to create your own.
Requirements¶
To run, you’ll need the following:
A dataset containing binary predictor columns
Import packages¶
[1]:
from iguanas.metrics.unsupervised import AlertsPerDay, PercVolume
import numpy as np
import databricks.koalas as ks
Create data¶
Let’s create some dummy predictor columns. For this example, let’s assume the dummy predictor columns represent rules that have been applied to a dataset.
[3]:
np.random.seed(0)
y_pred_ks = ks.Series(np.random.randint(0, 2, 1000), name = 'A')
y_preds_ks = ks.DataFrame(np.random.randint(0, 2, (1000, 5)), columns=[i for i in 'ABCDE'])
21/12/06 12:20:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/06 12:20:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/12/06 12:20:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
21/12/06 12:20:05 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
21/12/06 12:20:05 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
Apply optimisation functions¶
There are currently two unsupervised metrics available:
Alerts per day (calculates the negative squared difference between the daily number of records a rule flags vs the targetted daily number of records flagged)
Percentage of volume (calculates the negative squared difference between the percentage of the overall volume that the rule flags vs the targetted percentage of volume flagged)
Instantiate class and run fit method¶
We can run the .fit() method to calculate the optimisation metric for each column in the dataset.
Alerts per day¶
[4]:
apd = AlertsPerDay(n_alerts_expected_per_day=5, no_of_days_in_file=10)
# Single predictor
rule_apd_ks = apd.fit(y_preds=y_pred_ks)
# Multiple predictors
rule_apds_ks = apd.fit(y_preds=y_preds_ks)
Percentage of volume¶
[5]:
pv = PercVolume(perc_vol_expected=0.02)
# Single predictor
rule_pv_ks = pv.fit(y_preds=y_pred_ks)
# Multiple predictors
rule_pvs_ks = pv.fit(y_preds=y_preds_ks)
Outputs¶
The .fit() method returns the optimisation metric defined by the class:
[6]:
rule_apd_ks, rule_apds_ks
[6]:
(-2061.16, array([-2237.29, -1738.89, -2313.61, -2227.84, -2034.01]))
[7]:
rule_pv_ks, rule_pvs_ks
[7]:
(-0.234256, array([-0.253009, -0.199809, -0.261121, -0.252004, -0.231361]))
The .fit() method can be fed into various Iguanas modules as an argument (wherever the opt_func
parameter appears). For example, in the RuleGeneratorOpt module, you can set the metric used to optimise the rules using this methodology.