Mining small, useful, and high-quality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by post-processing it is virtually impossible to analyse dense or large databases in any detail.
We introduce Slim, an any-time algorithm for mining high-quality sets of itemsets directly from data. We use MDL to identify the best set of itemsets as that set that describes the data best. To approximate this optimum, we iteratively use the current solution to determine what itemset would provide most gain---estimating quality using an accurate heuristic. Without requiring a pre-mined candidate collection, Slim is parameter-free in both theory and practice.
Experiments show we mine high-quality pattern sets; while evaluating orders-of-magnitude fewer candidates than our closest competitor, Krimp, we obtain much better compression ratios---closely approximating the locally-optimal strategy. Classification experiments independently verify we characterise data very well.
Koen Smets and Jilles Vreeken. Slim: Directly Mining Descriptive Patterns. In Proceedings of SIAM International Conference on Data Mining (SDM), pages 236–247, 2012 SIAM.
Public release: source code and binaries.
Our implementation of Slim is freely available for research purposes; we provide both the source code and binaries for GNU/Linux (x64) and Windows (x86 and x64). Please refer to the documentation in the package for installation/compilation details and usage hints.
Download the most recent public release of Slim here.