mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Neumayer <neuma...@idi.ntnu.no>
Subject Re: FPGrowth maybe not producing the right rules
Date Thu, 15 Apr 2010 06:49:37 GMT
> I have to clarify this as many have asked this before. Mahout's
> Implementation is  Top K FPGrowth that finds closed patterns

Ok, I didn't know that, thanks for the clarification.

There's a paper explaining exactly this (is it this one you used for
implementing maybe?):

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.1102

R

On Thu, 15 Apr 2010, Robin Anil wrote:

>
> if {1} - 3 is a pattern {1, 2}- 3 is another top pattern  of item 1
>
> {1, 2} - 3 is a closed pattern of {1} -3 as there is no information that the
> former gives that that the latter gives extra, this gives significant boost
> in speed for the Mahout FP-Growth algorithm. If the closed pattern is not
> there, then there should be a problem with implementation otherwise its an
> algorithm choice. I am forgetting the paper form which I implemented this. I
> will post that when I find it.
>
> Robin
>
>
> On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neumayer@idi.ntnu.no>wrote:
>
>> Hi.
>>
>> I was trying to test the fpgrowth algorithm on a toy example and I kind of
>> fail
>> to get the correct results ... so it'd be nice if someone could take a look
>> at
>> my (admittedly very ad-hoc) code and tell me whether there's something
>> wrong with it. I'm running the mahout svn version.
>>
>> I tried to run the following example (input):
>>
>> 1 3 4
>> 2 3 5
>> 1 2 3 5
>> 2 5
>> 1 2 3 5
>>
>> With the following code:
>>  public static void main(String[] args) {
>>        int minSupport = 2;
>>        int maxHeapSize = 100;
>>
>>        String input = "inputfile";
>>        String output = "output";
>>
>>        FPGrowth<String> fp = new FPGrowth<String>();
>>        FileSystem fs = new RawLocalFileSystem();
>>        Configuration conf = new Configuration();
>>
>>        String pattern = "[\\ ]";
>>        try {
>>            fs = FileSystem.get(URI.create(output), conf);
>>
>>            SequenceFile.Writer writer = null;
>>            writer = new SequenceFile.Writer(fs, conf, new Path(output),
>> Text.class, TopKStringPatterns.class);
>>
>>            Charset encoding = Charset.forName("UTF-8");
>>
>>            List<Pair<String, Long>> generateFList = null;
>>            generateFList = fp.generateFList(new StringRecordIterator(new
>> FileLineIterable(new File(input), encoding, false),
>>                    pattern), minSupport);
>>
>>            StringRecordIterator transactions = null;
>>
>>            transactions = new StringRecordIterator(new FileLineIterable(new
>> File(input), encoding, false), pattern);
>>
>>
>>            List<Text> keyList = new LinkedList<Text>();
>>            List<TopKStringPatterns> valueList = new
>> LinkedList<TopKStringPatterns>();
>>
>>            StringOutputCollector<Text, TopKStringPatterns> collector = new
>> StringOutputCollector<Text, TopKStringPatterns>(
>>                    keyList, valueList);
>>            StringOutputConverter soc = new
>> StringOutputConverter(collector);
>>
>>            StatusUpdater updater = new StatusUpdater() {
>>                @Override
>>                public void update(String status) {
>>                    System.out.println(status);
>>                }
>>            };
>>
>>            fp.generateTopKFrequentPatterns(transactions, generateFList,
>> minSupport, maxHeapSize, null, soc, updater);
>>            writer.close();
>>            fs.close();
>>
>>            System.out.println("list.size: " + valueList.size());
>>            HashSet<List<String>> unique = new HashSet<List<String>>();
>>            for (int i = 0; i < valueList.size(); i++) {
>>
>>                System.out.println(keyList.get(i) + " / " +
>> valueList.get(i));
>>                Iterator<Pair<List<String>, Long>> iterator =
>> valueList.get(i).iterator();
>>                while (iterator.hasNext()) {
>>
>>                    unique.add(iterator.next().getFirst());
>>                }
>>            }
>>            // print a bit more clearly
>>            Iterator<List<String>> iterator = unique.iterator();
>>            while (iterator.hasNext()) {
>>                System.out.println(iterator.next());
>>            }
>>
>>        } catch (IOException e) {
>>            e.printStackTrace();
>>        }
>>    }
>>
>> I wrote another collector class (I know, that was probably not necessary
>> but I
>> wanted to use the API a bit):
>>
>> public class StringOutputCollector<K extends Writable, V extends Writable>
>> implements OutputCollector<K, V> {
>>
>>    private final List<K> keyList;
>>
>>
>>    private final List<V> valueList;
>>
>>    public StringOutputCollector(List<K> keyList, List<V> valueList) {
>>        this.keyList = keyList;
>>        this.valueList = valueList;
>>    }
>>
>>    @Override
>>    public final void collect(K key, V value) throws IOException {
>>        keyList.add(key);
>>        valueList.add(value);
>>    }
>>
>> }
>>
>> The output I get is the following:
>> 1 / ([1, 3],3), ([1, 2, 3, 5],2)
>> 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
>> 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
>> 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
>>
>> But I think there's more frequent itemsets to find in these transactions
>> (with
>> min support 2):
>>
>> {1}     3
>> {2}     4
>> {3}     4
>> {5}     4
>> {1, 2}  2
>> {1, 3}  3
>> {1, 5}  2
>> {2, 3}  3
>> {2, 5}  4
>> {3, 5}  3
>> {1, 2, 3}       2
>> {1, 2, 5}       3
>> {1, 3, 5}       2
>> {2, 3, 5}       3
>> {1, 2, 3, 5}    2
>>
>> Also, some of them are duplicates with the mahout code. I'm not sure
>> whether
>> it's supposed to be like that.
>>
>> I also tried another textbook example and that didn't produce the same
>> results
>> either.
>>
>> So am I doing something wrong with the topk selection (that's what I maybe
>> could think of as a stupid mistake of mine)?
>>
>> Any hints would be appreciated.
>>
>> R
>>
>


--
Robert Neumayer
Norwegian University of Science and Technology (NTNU)
Department of Computer and Information Science
Sem Sælands vei 7-9
NO-7491 Trondheim
http://www.idi.ntnu.no/~neumayer

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message