Convert Processed Conditions to General Example¶
This notebook contains an example of how the Convert Processed Conditions to General module can be used to convert rule conditions that leverage processed features (either imputed values or OHE values) into rule conditions that leverage the unprocessed features.
Requirements¶
To run, you’ll need the following:
A rule set (stored in the standard Iguanas string format) that contains processed features.
Import packages¶
[3]:
from iguanas.rules import ConvertProcessedConditionsToGeneral, ReturnMappings
import pandas as pd
Read in dataset¶
Let’s first read in some datasets - X represents the raw, unprocessed data, while y represents the binary target:
[4]:
X = pd.read_csv(
'dummy_data/X.csv',
index_col='eid'
)
y = pd.read_csv(
'dummy_data/y.csv',
index_col='eid'
).squeeze()
Processing the data¶
Now we’ll apply the standard data cleaning processes that need to be carried out before feeding the data into one of the rule generator modules - namely, imputing nulls and OHE encoding the categorical columns:
[5]:
imputed_values = {
'num_items': -1,
'country': 'missing'
}
X_processed = X.fillna(imputed_values)
X_processed = pd.get_dummies(X_processed)
[6]:
X_processed.head()
[6]:
num_items | country_FR | country_GB | country_US | country_missing | |
---|---|---|---|---|---|
eid | |||||
0 | 1.0 | 0 | 1 | 0 | 0 |
1 | 2.0 | 0 | 0 | 1 | 0 |
2 | -1.0 | 1 | 0 | 0 | 0 |
3 | 3.0 | 0 | 0 | 0 | 1 |
4 | 1.0 | 0 | 1 | 0 | 0 |
Generating rules¶
Now let’s say we ran one of the Iguanas rule generators on the processed dataset and generated the following rules:
[7]:
rule_strings = {
'Rule1': "(X['num_items']<2)",
'Rule2': "(X['country_missing']==True)",
'Rule3': "(X['country_US']==True)",
'Rule4': "(X['num_items']<0)&(X['country_missing']==True)"
}
These rule conditions all contain processed features - they have either been imputed or one hot encoded. If we applied them directly to the raw, unprocessed data, it would either:
Create inaccurate representations of the rules if they use imputed numeric values (since the rule conditions may include null values, but wouldn’t account for this when applied to raw, unprocessed data).
Cause an error when the rules are applied, since the one hot encoded variables don’t exist in the raw, unprocessed data.
Hence, we need to convert the conditions which leverage processed features into conditions which use the original, unprocessed features, if applying the rules to raw, unprocessed data.
Converting rule conditions¶
First, let’s instantiate the ConvertProcessedConditionsToGeneral class. To do this, we need to provide the imputed values and the mapping of OHE columns to categories. For small datasets, this is relatively straightforward; however for larger datasets where multiple imputed values have been used, or a large number of columns have been OHE’d, this can be time consuming to do manually. Instead, we can use the ReturnMapping class to calculate this information:
[8]:
rm = ReturnMappings()
Let’s first return the imputed values used for each field:
[9]:
imputed_values_mapping = rm.return_imputed_values_mapping(
[['num_items'], -1],
[['country'], 'missing']
)
Now let’s return the category that relates to each OHE’d column:
[10]:
ohe_categories_mapping = rm.return_ohe_categories_mapping(
pre_ohe_cols=X.columns,
post_ohe_cols=X_processed.columns,
pre_ohe_dtypes=X.dtypes
)
Once we have these mappings, we can instantiate the ConvertProcessedConditionsToGeneral class:
[11]:
c = ConvertProcessedConditionsToGeneral(
imputed_values=imputed_values_mapping,
ohe_categories=ohe_categories_mapping
)
Now we can run the .convert() method to convert the conditions in the rules generated above from using the processed features to using the original, unprocessed features:
[12]:
general_rule_strings = c.convert(
rule_strings=rule_strings,
X=X_processed
)
[13]:
general_rule_strings
[13]:
{'Rule1': "((X['num_items']<2)|(X['num_items'].isna()))",
'Rule2': "(X['country'].isna())",
'Rule3': "(X['country']=='US')",
'Rule4': "(X['num_items'].isna())&(X['country'].isna())"}
Outputs¶
The .convert() method returns a dictionary containing the set of rules which account for imputed/OHE variables, defined using the standard Iguanas string format (values) and their names (keys).
Note the following:
If a numeric rule condition initially had a threshold such that the imputed null values were included in the condition, the converted condition has an additional condition to check whether the feature is also null.
E.g. Rule1 was initially (X[‘num_items’]<2), which included the imputed value of 0. The converted rule is now ((X[‘num_items’]<2)|(X[‘num_items’].isna())), with an additional condition to check for nulls.
If a categorical rule condition checks whether the value is the imputed null category, the converted condition is such that it will explicitly check for null values.
E.g. Rule2 was initially (X[‘country_missing’]==True). The converted rule is now (X[‘country’].isna()), such that it explicitly checks for null values.
For categorical rule conditions, the converted condition is such that it will explicitly check for the category.
E.g. Rule3 was initially (X[‘country_US’]==False). The converted rule is now (X[‘country’]!=‘US’), such that it explicitly checks whether the ‘country’ column is not equal to the ‘US’ category.
A useful attribute created by running the .convert() method is:
rules: Class containing the rules stored in the standard Iguanas string format. Methods from this class can be used to convert the rules into the standard Iguanas dictionary or lambda expression representations. See the rules module for more information.
[14]:
general_rule_strings
[14]:
{'Rule1': "((X['num_items']<2)|(X['num_items'].isna()))",
'Rule2': "(X['country'].isna())",
'Rule3': "(X['country']=='US')",
'Rule4': "(X['num_items'].isna())&(X['country'].isna())"}