CORELS (Certifiable Optimal RulE ListS) is a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. Our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.
If you're not sure what a rule list is, head over to our rule list introduction before continuing.
CORELS is a supervised learning algorithm, trained on a dataset of n rule antecedents, where each antecedent captures some subset of m samples. These antecedents must be pre-mined - that is, CORELS is inputted not a list of samples with their associated features, but a list of rule antecedents, each represented by a bitvector of length m. The i'th bit of a rule's bitvector represents whether that antecedent captures the i'th sample (1) or if it doesn't (0).
All features are binary, as is the classification of all the rules.
Since all the sample features as well as the rule list classifications must be binary, continuous values or multi-class features must be broken down into categories. For instance, if a particular data set contained samples of people, and one of its features was age, instead of having one continuous 'age' feature you would have to break it down into discrete categories that could then become binary features. For instance, if the range of ages of your training set was 0 - 64, you could reasonably break the age feature into 4 binary features, age < 16
, age < 32
, age < 32
, and age < 64
.
Once trained, CORELS outputs a rule list. If allowed to run unconstrained, CORELS will output the certifiably most accurate rule list that can be generated from the given set of rule antecedents over the given training set. However, finding the optimal rule list may be computationally infeasible with large datasets, so we provide an option to stop searching after a certain user-specified number of nodes of the search trie have been evaluated. For suboptimal rule lists, an option is also provided to calculate the upper bound on the remaining search space size.
CORELS is a branch-and-bound algorithm, and leverages several algorithmic bounds to prune a trie that represents the search space of all possible rule lists to be examined.
The search space of all possible rule lists is represented by a prefix tree, or trie. Assuming n
antecedents, the root node of this trie has n
children, one for each antecedent. Each of those children have (n - 1)
children, and so on, forming a trie with n levels (not counting the root node). Each node in this trie represents a rule list prefix (with each node's classification being 0 or 1, depending only on which gives the most accurate prediction over the training data set), and the bounds we calculate to remove nodes operate on these prefixes.
As with other supervised learning algorithms, CORELS defines an objective function for possible models (rule lists), and then seeks to minimize it. The objective of a rule list is simply the fraction of training samples that the rule list misclassifies, plus a factor that is proportional to the length of the rule list. By keeping track of the lowest objective found, it is possible to eliminate many rule lists without evaluating them at all, greatly speeding up the search.
The objective function is defined as follows:
R(d, x, y) = misc(d, x, y) + c * length(d)
Where:
d
is the rule listx
is the training data featuresy
is the training data classificationsmisc
is the fraction of misclassified samplesc
is a user-defined constant (called the regularization constant). A larger value of this penalizes larger rule lists - it can be thought of as an accuracy penalty for each added rule to the list (if it is set to 0.01, for instance, every added rule would incur a penalty equal to misclassifying 1% of the samples)b(p, x, y) = misc(p, x, y) + c * length(p)
If this lower bound is greater than the lowest objective found so far, then this prefix along with every node that is a child of it can be removed from the tree, since any rule list beginning with that prefix must necessarily have an objective that is greater than the best objective found so far.