mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > ParallelFrequentPatternMining
Date Fri, 28 May 2010 19:48:00 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: ParallelFrequentPatternMining (http://cwiki.apache.org/confluence/display/MAHOUT/ParallelFrequentPatternMining)


Edited by Grant Ingersoll:
---------------------------------------------------------------------
h1. Frequent Pattern Mining
Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper http://infolab.stanford.edu/~echang/recsys08-69.pdf
with some optimisations in mining the dat. 

Given a huge transaction list, the algorithm finds all unique features(field values) and eliminates
those features whose frequency in the whole dataset is less that minSupport. Using these remaining
features N, we find the top K closed patterns for each of them, generating a total of NxK
patterns. FPGrowth Algorithm is a generic implementation, we can use any Object type to denote
a feature. Current implementation requires you to use a String as the object type. You may
implement a version for any object by creating Iterators, Convertors and TopKPatternWritable
for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string

{code}
e.g:
 FPGrowth<String> fp = new FPGrowth<String>();
 Set<String> features = new HashSet<String>();
 fp.generateTopKStringFrequentPatterns(
     new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern),

        fp.generateFList(
          new StringRecordIterator(new FileLineIterable(new File(input), encoding, false),
pattern), minSupport),
         minSupport,
        maxHeapSize, 
        features,
        new StringOutputConvertor(new SequenceFileOutputCollector<Text, TopKStringPatterns>(writer))
  );
{code}
* The first argument is the iterator of transaction in this case its Iterator<List<String>>
* The second argument is the output of generateFList function, which returns the frequent
items and their frequencies from the given database transaction iterator
* The third argument is the minimum Support of the pattern to be generated
* The fourth argument is the maximum number of patterns to be mined for each feature
* The fifth argument is the set of features for which the frequent patterns has to be mined
* The last argument is an output collector which takes [key, value] of Feature and TopK Patterns
of the format [String, List<Pair<List<String>, Long>>] and writes them to
the appropriate writer class which takes care of storing the object, in this case in a Sequence
File Output format

h2. Running Frequent Pattern Growth via command line

The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver
has other features including specifying the regex pattern for spitting a string line of a
transaction into the constituent features

Input files has to be in the following format.

<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE.....
instead of tab you could use , or | as the default tokenization is done using a java Regex
pattern [,\t]*[,|\t][ ,\t]*
You can override this parameter to parse your log files or transaction files 

each line is a transaction FPGrowth algorithm mines Top K frequently occurring sets of items
and their counts from the given input data
To run the example datasets from http://fimi.cs.helsinki.fi/data/. Choose say accidents.dat.gz
and download it to your mahout folder
{code}
mvn -e  exec:java   -Dexec.mainClass=org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver \
     -Dexec.args="-i accidents.dat.gz \
     -o patterns \
     -k 50 \
     -method sequential \
     -regex [\ ] \
     -s 2
{code}

The minimum Support parameter is the minimum number of times a pattern or a feature needs
to occur in the dataset so that it is included in the patterns generated. You can speed up
the process by having a large value of s. There are cases where you will have less than k
patterns for a particular feature as the rest don't qualify the minimum support criteria

Note that the input to the algorithm, could be uncompressed or compressed gz file or even
a directory containing any number of such files.
We modified the regex to use space to split the token. Note that input regex string is escaped.
The output will be dumped to a SequenceFile in the  frequentpatterns directory in Text=>TopKStringPatterns
format. You can use the "bin/mahout seqdumper" command to inspect the output file.  TODO FILL
IN MORE HERE. 

h2. Running Parallel FPGrowth 

Running parallel FPGrowth is as easy as adding changing the flag -method mapreduce and adding
the number of groups parameter e.g. -g 20 for 20 groups. Put the accidents.dat.gz on the hdfs
in a folder named accidents

{code}
To run on a hadoop cluster

<HADOOP-BIN>/hadoop jar mahout-examples-xx.job org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver
\
     -i accidents \
     -o patterns \
     -k 50 \
     -method mapreduce \
     -g 10 \
     -regex [\ ] \
     -s 2

OR to demo the algorithm on a single machine

bin/mahout fpg \
     -i accidents \
     -o patterns \
     -k 50 \
     -method mapreduce \
     -g 10 \
     -regex [\ ] \
     -s 2
{code}

Note that accidents have 340 unique features. So we chose -g 10 to split the transactions
across 10 shards where 34 patterns are mined from each shard. note g doesnt need to be exactly
divisible. The Algorithm takes care of calculating the split. For better performance in large
datasets try not to mine for more than 20-25 features per shard. So adjust the groups accordingly

The numGroups parameter in FPGrowthJob specifies the number of groups into which transactions
have to be decomposed. 
The numTreeCacheEntries parameter specifies the number of generated conditional FP-Trees to
be kept in memory so that subsequent operations do not to regenerate them. Increasing this
number increases the memory consumption but might improve speed until a certain point. This
depends entirely on the dataset in question. A value of 5-10 is recommended for mining up
to top 100 patterns for each feature
 

Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message