iguanas.rule_generation.RuleGeneratorDT

class iguanas.rule_generation.RuleGeneratorDT(opt_func: Callable, n_total_conditions: int, tree_ensemble: Union[sklearn.ensemble._forest.RandomForestClassifier, sklearn.ensemble._forest.ExtraTreesClassifier], precision_threshold=0, num_cores=1, target_feat_corr_types=None, verbose=0, rule_name_prefix=None)[source]

Generate rules by extracting the highest performing branches from a tree ensemble model.

Parameters
opt_funcCallable

A function/method which calculates the desired optimisation metric (e.g. Fbeta score).

n_total_conditionsint

The maximum number of conditions per generated rule.

tree_ensembleUnion[RandomForestClassifier, ExtraTreesClassifier]

Instantiated Sklearn tree ensemble classifier object used to generated rules.

precision_thresholdfloat, optional

Precision threshold for the tree/branch to be used to create rules. If the overall precision of the tree/branch is less than or equal to this value, it will not be used in rule generation. Note that if bootstrap == True in the tree_ensemble class, the precision will be based on the bootstrapped sample used to create the tree. Defaults to 0.

num_coresint, optional

The number of cores to use when iterating through the ensemble to generate rules. Defaults to 1.

target_feat_corr_typesUnion[Dict[str, List[str]], str], optional

Limits the conditions of the rules based on the target-feature correlation (e.g. if a feature has a positive correlation with respect to the target, then only greater than operators are used for conditions that utilise that feature). Can be either a dictionary specifying the list of positively correlated features wrt the target (under the key PositiveCorr) and negatively correlated features wrt the target (under the key NegativeCorr), or ‘Infer’ (where each target-feature correlation type is inferred from the data). Defaults to None.

verboseint, optional

Controls the verbosity - the higher, the more messages. >0 : gives the overall progress of the training of the ensemble model and the extraction of the rules from the trees; >1 : also shows the progress of the training of the individual trees in the ensemble model. Defaults to 0.

rule_name_prefixstr, optional

Prefix to use for each rule name. If None, the standard prefix is used. Defaults to None.

Attributes
rule_stringsDict[str, str]

The generated rules, defined using the standard Iguanas string format (values) and their names (keys).

rule_descriptionsPandasDataFrameType

A dataframe showing the logic of the rules and their performance metrics on the given dataset.

fit(X: iguanas.utils.typing.pandas.core.frame.DataFrame, y: iguanas.utils.typing.pandas.core.series.Series, sample_weight=None) iguanas.utils.typing.pandas.core.frame.DataFrame[source]

Generates rules by extracting the highest performing branches in a tree ensemble model.

Parameters
XPandasDataFrameType

The feature set used for training the model.

yPandasSeriesType

The target column.

sample_weightPandasSeriesType, optional

Record-wise weights to apply. Defaults to None.

Returns
PandasDataFrameType

The binary columns of the rules on the fitted dataset.

as_rule_dicts() Dict[str, dict]

Converts rules into the standard Iguanas dictionary format.

Returns
Dict[str, dict]

Rules in the standard Iguanas dictionary format.

as_rule_lambdas(as_numpy: bool, with_kwargs: bool) Dict[str, Callable[[dict], str]]

Converts rules into the standard Iguanas lambda expression format.

Parameters
as_numpybool

If True, the conditions in the string format will uses Numpy rather than Pandas. These rules are generally evaluated more quickly on larger dataset stored as Pandas DataFrames.

with_kwargsbool

If True, the string in the lambda expression is created such that the inputs are keyword arguments. If False, the inputs are positional arguments.

Returns
Dict[str, Callable[[dict], str]]

Rules in the standard Iguanas lambda expression format.

as_rule_strings(as_numpy: bool) Dict[str, str]

Converts rules into the standard Iguanas string format.

Parameters
as_numpybool

If True, the conditions in the string format will uses Numpy rather than Pandas. These rules are generally evaluated more quickly on larger dataset stored as Pandas DataFrames.

Returns
Dict[str, str]

Rules in the standard Iguanas string format.

filter_rules(include=None, exclude=None) None

Filters the rules by their names.

Parameters
includeList[str], optional

The list of rule names to keep. Defaults to None.

excludeList[str], optional

The list of rule names to drop. Defaults to None.

Raises
Exception

include and exclude cannot contain similar values.

get_rule_features() Dict[str, set]

Returns the set of unique features present in each rule.

Returns
Dict[str, set]

Set of unique features (values) in each rule (keys).

transform(X: Union[iguanas.utils.typing.pandas.core.frame.DataFrame, iguanas.utils.typing.databricks.koalas.frame.DataFrame], y=None, sample_weight=None) Union[iguanas.utils.typing.pandas.core.frame.DataFrame, iguanas.utils.typing.databricks.koalas.frame.DataFrame]

Applies the set of rules to a dataset, X. If y is provided, the performance metrics for each rule will also be calculated.

Parameters
XUnion[PandasDataFrameType, KoalasDataFrameType]

The feature set on which the rules should be applied.

yUnion[PandasSeriesType, KoalasSeriesType], optional

The target column. Defaults to None.

sample_weightUnion[PandasSeriesType, KoalasSeriesType], optional

Record-wise weights to apply. Defaults to None.

Returns
Union[PandasDataFrameType, KoalasDataFrameType]

The binary columns of the rules.