Here is the command I used to run PFPGrowth. I am still using only single machine. Will be
setting up hadoop cluster soon.
$ hadoop jar mahout-examples-0.4-job.jar org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver
-i downloads-input -o reco-patterns-output -k 50 -method mapreduce
-g 10 -regex '[\ ]' -s 500
-----Original Message-----
From: ext Robin Anil [mailto:robin.anil@gmail.com]
Sent: Tuesday, November 09, 2010 1:01 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns
On Tue, Nov 9, 2010 at 11:20 PM, <praveen.peddi@nokia.com> wrote:
> Hi Anil,
> 1. I am not sure if I understand your answer to #1 (or were you asking
> me a question?). Could you pls clarify? The sample patterns I gave is
> only a small subset from the output. I included only those two
> features for simplicity.
>
Oh. Never mind. Let me see
> 2. I am sending the gzipped sample transaction file (1M downloads) to
> your private email since I am not sure if I can attach files to the mailing list.
> Please check your email for the sample file.
>
> Praveen
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 12:40 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 9:50 PM, <praveen.peddi@nokia.com> wrote:
>
> > Hello all,
> > I am new to mahout. I have just started looking into mahout to
> > replace our current fpgrowth implementation with a parallel fp
> > growth that Mahout since we started having scalability issues. I
> > looked at PFPGrowth documentation and I noticed that it only
> > produces top K frequent patterns but not the associations and what
> > we need is associations. So I was thinking of implementing a simple
> > AssociationGenerator given the frequent patterns output. However I
> > am not sure what is the best way to generate associations given the
> > frequent
> patterns produced by mahout.
> >
> > I have the following sample output from mahout.
> >
> > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > Key: 46705: Value: ([46705],2526)
> >
> > We are interested only in item set size of 2 since we need only 1
> > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> >
> > I was planning to calculate associations with confidence as follows:
> > For each key above as A {
> > for each two-item set as [A,C] {
> > confidence (A->C) = support(A->C)/support(C);
> > add association (A, C, confidence(A->C) to the list;
> > }
> > }
> >
> > Keeping the above requirement and pseudo code n mind, my questions
> > as
> > follows:
> > 1. Is the above algorithm efficient?
> >
> You are running it over a set of Top K patterns. Its small. doesnt
> matter if its inefficient or not
>
> > 2. In the first pattern, [46705, 46485] occurred 355 times but in
> > second pattern why is the same pattern not repeated. Because of this
> > calculating confidence (46705 -> 46485) becomes difficult. As you
> > can see from above code, I was planning to read patterns for each
> > feature and calculate confidence of all association with antecedent.
> > But when I read feature 46705, I cannot calculate confidence of
> > (46705 ->
> > 46485) since the item set is not included with the feature.
> >
> Good question. I guess the partitioning is screwing this up as there
> are other K-1 patterns in the list > 355. Can you give a sample to test.
>
> > 3. Has anyone implemented associations from the generated frequent
> > patterns.
> >
> Nope
>
> >
> >
> > Thanks
> > Praveen
> >
> >
>
|