A C++ Frequent Itemset Mining Template Library


Frequent Itemset Mining (FIM) is the most researched field of frequent pattern mining. Over one hundred FIM algorithms were proposed - the majority claiming to be the most efficient. To clarify this chaos and the contradictions, two FIMI competitions were organized. It not only resulted in some excellent implementations but the community also get better understanding (limits, drawbacks, advantages) of the algorithms. Although FIMI'03 and FIMI'04 raised higher the level of publication, it has not solved many problem and we still suffer from the lack of understanding. For more information about the curse of FIM click here.

The separate work of different research groups, however, cannot be so efficient than tightly cooperating groups that interchange their knowledge. The goal of this FIM environment is to support this tight cooperation by providing an environment that includes base library functions, like IO, coding, decoding, ... and some high level description of the most important algorithms, such as apriori, eclat, FP-growth together with some very efficient implementations of them. The environment also contains many test classes to test the solutions for specific problems.

Flexibility and plugability are achieved by a template-based solution, that does not prevent the compiler form code optimization (like virtual function-based solution), which results in codes that are competitive with the best solutions.

If you use this library in a research paper then a citation to either of the following papers is welcome.

  1. Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme, Benchmarking Frequent Itemset Mining Algorithms: from Measurement to Analysis, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 36 - 45, Chicago, IL, USA. 2005.
    [PDF] [PS] [Slides] [BibTex]
  2. Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006,
    [PDF] [PS],[BibTex]

Programs included

The library includes an some optimized input/output and coding/decoding classes, allocators, many well designed data structures (trie, Patricia-tree, ) database cachers, some very efficient Apriori, Eclat and FP-growth algorithms, an Apriori algorithm that finds frequent sequences of items and an association rule miner that uses an Apriori to find frequent itemsets. Also, there exist many test classes that aims to test the efficiency of differrent techniques, like transaction caching, dead-end pruning, equisupport pruning, candidate generation technique, etc.

Major programs included in the Library
Name Description arguments
compile command
main source file
fim This program includes the most important FIM algorithms, i.e. Apriori, Eclat, FP-growth There are 4 mandatory parameters:
  1. the type of algorithm to run, i.e. apriori, apriori-lowmem, fp-growth
  2. the file that contain the transactions in market-basket format,
  3. absolute support threshold.
  4. the output file the frequent itemsets are sent to,
There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
fim apriori ../data/kosarak.dat 1500 output.txt 8
make ../fim*
main.cpp
assoc_rule This is an association rule miner, that uses Apriori to find the frequent itemsets There are 5 mandatory parameters:
  1. The file that contain the transactions in market-basket format,
  2. absolute support threshold.
  3. confidence threshold
  4. lift threshold
  5. the output file the frequent itemsets are sent to,
Example:
assoc_rule ../data/kosarak.dat 1500 0.2 3 output.txt
make assoc_rule*
main-assoc-rule.cpp
apriori_simple A simplified Apriori implementation that omits many speed-up techniques. We intend this program for students or researchers, who would like to have an efficient but still easy-to-understand and easy-to-modify Apriori implementation. If you don't find the speed of this Apriori satisfactory, you can check the implementation in main.cpp. There are 3 mandatory parameters:
  1. the file that contain the transactions in market-basket format,
  2. absolute support threshold.
  3. the output file the frequent itemsets are sent to,
There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
./apriori-simple ../data/kosarak.dat 1500 output.txt 8
make apriori_simple*
main-apriori-simple.cpp
fsm An Apriori-based frequent item sequence miner that uses a trie to store the candidates. More information is found in the paper:

Ferenc Bodon, Trie-based APRIORI Implementation for Mining Frequent Itemsequences, ACM SIGKDD Workshop on Open Source Data Mining Workshop (OSDM'05), in Bart Goethals and Siegfried Nijssen and Mohammed J. Zaki editors, pages 56 - 65, Chicago, IL, USA. 2005.
[PDF] [PS][Slides][BibTex]
There are 4 mandatory parameters:
  1. the type of algorithm to run, i.e. apriori, apriori-noprune, apriori-intersectprune
  2. the file that contain the transactions in market-basket format,
  3. absolute support threshold.
  4. the output file the frequent itemsets are sent to,
There is one optional argument, which specifies the upper limit of the size of frequent itemsets to mine.
Example:
fsm apriori ../data/seq/kosarak2_10_4.dat 1500 output.txt 5
make ../fsm*
main-fsm.cpp
*Before compiling any program please run the cd src;make dep command.

Download

binaries:

For

source:

Before compiling please run the cd src;make dep command. The compile commands are given in the table above.

To get (update) your local copy of the documentation run make doc in the src directory. The documentation will be found under doc/html. Please note, that you need doxygen and graphviz installed to be able to generate the documentation.

Documentation

Test environment

Main developers

Useful links

Similar template library, called DMTL, was proposed by Hasan et al. The purpose of DMTL differs from the purpose of our library. The DTML is more pattern and application oriented, we concentrate on algorithms and data structures. DMTL is an excellent tool if we would like to mine a new type of patterns using the existing algorithms, and our library comes at hand if we would like to change a part of an existing algorithm. In DTML the algorithms are the building blocks, while in our library we disassemble the methods as much as it makes sense.

Todo list

Note, that the library is in developement phase. It will be extended by many new classes and functionalities, some interfaces will change, the documentation will be improved and a restructuring is also expected.

A more detailed todo list is found here.