Classification Metrics Spark Example

This notebook contains an example of how classification metrics can be applied to a dataset stored in Spark, how they can be used in other Iguanas modules and how to create your own.

Requirements

To run, you’ll need the following:

  • A dataset containing binary predictor columns and a binary target column.


Import packages

[1]:
from iguanas.metrics.classification import Precision, Recall, FScore, Revenue

import numpy as np
import databricks.koalas as ks
from typing import Union

Create data

Let’s create some dummy predictor columns and a binary target column. For this example, let’s assume the dummy predictor columns represent rules that have been applied to a dataset.

[2]:
np.random.seed(0)

y_pred_ks = ks.Series(np.random.randint(0, 2, 1000), name = 'A')
y_preds_ks = ks.DataFrame(np.random.randint(0, 2, (1000, 5)), columns=[i for i in 'ABCDE'])
y_ks = ks.Series(np.random.randint(0, 2, 1000), name = 'label')
amounts_ks = ks.Series(np.random.randint(0, 1000, 1000), name = 'amounts')
21/12/06 12:03:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/06 12:03:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/12/06 12:03:26 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
21/12/06 12:03:26 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.

Apply optimisation functions

There are currently four classification metrics available:

  • Precision score

  • Recall score

  • Fbeta score

  • Revenue

Note that the FScore, Precision or Recall classes are ~100 times faster on larger datasets compared to the same functions from Sklearn’s metrics module. They also work with Koalas DataFrames, whereas the Sklearn functions do not.

Instantiate class and run fit method

We can run the .fit() method to calculate the optimisation metric for each column in the dataset.

Precision score

[3]:
precision = Precision()
# Single predictor
rule_precision_ks = precision.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# Multiple predictors
rule_precisions_ks = precision.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

Recall score

[4]:
recall = Recall()
# Single predictor
rule_recall_ks = recall.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# Multiple predictors
rule_recalls_ks = recall.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

Fbeta score (beta=1)

[5]:
f1 = FScore(beta=1)
# Single predictor)
rule_f1_ks = f1.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# Multiple predictors)
rule_f1s_ks = f1.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

Revenue

[6]:
rev = Revenue(y_type='Fraud', chargeback_multiplier=2)
# Single predictor
rule_rev_ks = rev.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=amounts_ks)
# Multiple predictors
rule_revs_ks = rev.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=amounts_ks)

Outputs

The .fit() method returns the optimisation metric defined by the class:

[7]:
rule_precision_ks, rule_precisions_ks
[7]:
(0.48214285714285715,
 array([0.4875717 , 0.47109208, 0.47645951, 0.48850575, 0.4251497 ]))
[8]:
rule_recall_ks, rule_recalls_ks
[8]:
(0.5051975051975052,
 array([0.53014553, 0.45738046, 0.52598753, 0.53014553, 0.44282744]))
[9]:
rule_f1_ks, rule_f1s_ks
[9]:
(0.4934010152284264,
 array([0.50796813, 0.46413502, 0.5       , 0.50847458, 0.43380855]))
[10]:
rule_rev_ks, rule_revs_ks
[10]:
(1991, array([ 15119, -14481,  11721,  25063, -74931]))

The .fit() method can be fed into various Iguanas modules as an argument (wherever the opt_func parameter appears). For example, in the RuleGeneratorOpt module, you can set the metric used to optimise the rules using this methodology.


Creating your own optimisation function

Say we want to create a class which calculates the Positive likelihood ratio (TP rate/FP rate).

The main class structure involves having a .fit() method which has three arguments - the binary predictor(s), the binary target and any event specific weights to apply. This method should return a single numeric value.

[14]:
class PositiveLikelihoodRatio:

    def fit(self,
            y_true: ks.Series,
            y_preds: Union[ks.Series, ks.DataFrame],
            sample_weight: ks.Series) -> float:

        def _calc_plr(y_true, y_preds):
            # Calculate TPR
            tpr = (y_true * y_preds).sum() / y_true.sum()
            # Calculate FPR
            fpr = ((1 - y_true) * y_preds).sum()/(1 - y_true).sum()
            return 0 if tpr == 0 or fpr == 0 else tpr/fpr

        # Set this option to allow calc of TPR/FPR on Koalas dataframes
        with ks.option_context("compute.ops_on_diff_frames", True):
            if y_preds.ndim == 1:
                return _calc_plr(y_true, y_preds)
            else:
                plrs = np.empty(y_preds.shape[1])
                for i, col in enumerate(y_preds.columns):
                    plrs[i] = _calc_plr(y_true, y_preds[col])
                return plrs

We can then apply the .fit() method to the dataset to check it works:

[12]:
plr = PositiveLikelihoodRatio()
# Single predictor
rule_plr_ks = plr.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# Multiple predictors
rule_plrs_ks = plr.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

[13]:
rule_plr_ks, rule_plrs_ks
[13]:
(1.004588142519177,
 array([1.02666243, 0.96105448, 0.98196952, 1.0305076 , 0.79801195]))

Finally, after instantiating the class, we can feed the .fit method to a relevant Iguanas module (for example, we can feed the .fit() method to the opt_func parameter in the BayesianOptimiser class so that rules are generated which maximise the Positive Likelihood Ratio).