APRIORI implementation of Ferenc Bodon

Attention!!!
As a result of some month of developement a new Apriori implementation is available. It is faster, requires less memory, but most importantly it is fully pluggable, i.e you can change any part of it. The implementation is a part of the FIM template library. The run-time and memory need comparison with my previous implementation (which can be downloaded from this page) and with the other famous available implementations, is found here. If I change the Apriori, then I change the version found in the FIM template Library. Therefore, the version that can be downloaded from this page, is not updated any more and this page is no longer maintained.

APRIORI algorithm was originally proposed by Agrawal in "Fast Algorithms for Mining Association Rules" in 1994 to find frequent itemsets and association rules in a transaction database. Here you can download a fast, trie-based, command-line implementation of the APRIORI algorithm for Linux and Windows platforms. The code is written in standard C++ in object-oriented manner. It uses STL possibilities, if that does not deteriorate the efficiency.

The code can be freely used for research purposes.

If you use this program in a research paper then a citation to either of the following papers is welcome.

Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006,
[PDF] [PS],[BibTex]
Ferenc Bodon, Surprising results of trie-based FIM algorithms, IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'04), in Bart Goethals and Mohammed J. Zaki and Roberto Bayardo editors, CEUR Workshop Proceedings, volume 90, Brighton, UK, 1. November 2004.
[PDF] [PS] [Slides] [BibTex]
Ferenc Bodon, A fast APRIORI implementation, IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, Florida, USA, 2003.
[PDF] [PS][BibTex]

Download

Binaries:

to Linux (ver. 2.4.9, i386, gcc 3.x): apriori_linux.tar.gz
to Windows (ver. 2.4.7): apriori_windows.zip

The zip file contains the program and a configuration file (.apriori_config). Copy the configuration file to the folder where the apriori program is located. Attention! Version 2.4.0 and above does not support association rule mining. If you need association rules not just frequent itemsets download version 2.3.2 (it can be found in the previous versions section).

source:

Latest version (2.4.9)

apriori.tar.gz: Source code and documentation (date of issue: 11. March 2005.)

To compile in Unix systems do the following:

cd source; make; cd ..

this will result in an apriori program.

To get the windows binary Dev-C++5 and its built-in compiler (Mingw) were used.

Previous versions

Version 2.3.2 (date of issue: 16. August 2004.) source, linux binary, windows binary
Version 2.1.3 (date of issue: 2. March 2004.) apriori.zip (only for Linux platforms)
Version 1.4.8 (date of issue: 2. March 2004.) apriori.zip (only for Linux platforms)

Version 2.2.x (and later versions) applies a more object oriented approach than 2.0.x and 2.1.x. This does not result in a faster execution, but it is more flexible and less error-prone. Test results that show running time differences between version 2.2.0 and version 2.1.3 can be found here.

Version 2. applies a different representation of the trie. It is not necessary faster, but it is more object oriented, thus it is more flexible and less error-prone. Test results that show running time differences between version 1.4.7 and version 2.0.0 can be found here.

Documentation

On-line documentation
bodon.pdf and bodon_trie.pdf: Some theoretical remarks on the program.

Usage

To run apriori 3 parameters are mandatory:

The file that contain the transactions in market-basket format,
the output file the frequent itemsets are sent to,
support threshold.

For example you can type:

./apriori T40I10D100K.dat output.txt 0.003

If you want to mine association rules as well, then you can provide a 4^th parameter that gives the confidence threshold (this is not available in version 2.4.0 and above).
If your input data file is not in market-basket model, look around in the recode_reformat folder.

Test environment

The main purpose of this work is to provide a nice, easy-to-read implementation of APRIORI, from which other APRIORI-based algorithms (for example DHP, DIC, Toivonen's sampling algorithm, DF-APRIORI, MSApriori, APRIORIALL, ...) can easily be developed without to much effort. Besides the easy-to-read, easy-to-modify goals, the efficiency concerning both running time and memory need was also an important aspect. Every improvement step that could have effect on the efficiency was tested on many databases with many support threshold values. Here you can download the two bash scripts that run the programs and collect the running times and memory needs.

Script to test running time: test_running_time
Script to test memory need: test_memory. This script requires the program memusage!