mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: FP Growth Understanding
Date Mon, 15 Feb 2010 16:14:30 GMT
Hi Neal,
             I know there is repetition. I tried sticking true to the
original algorithm that is finding closed patterns and using the longest
one.

Say if 68 and 12 occurs 1000 times
and 68 12 17 also occurs 1000 times, there so information that former
pattern gives you. So, you can remove it. Therefore you say that 68 12 17 is
a closed pattern and all the patterns it is enclosing are removed.

had 68 alone occurred 2000 times. It no longer becomes a closed pattern..

Things could be made configurable by having a flag to remove closed patterns
within a percentage of the support Or mine only patterns > 3 items in
length. These are tricky but could be done.

Robin


On Mon, Feb 15, 2010 at 9:34 PM, Neal Richter <nrichter@gmail.com> wrote:

> Grant:  Chapter 5 of Han and Kamber (Data Mining: Concepts and
> Techniques) detail itemset mining and the fpgrowth alg.  Han is a
> co-inventor of it.
>
> There is a bit of repetition in the output compared to other itemset
> mining packages, though this structure is convenient for relational
> indexing by key.
>
> - Neal
>
> On Mon, Feb 15, 2010 at 6:49 AM, Robin Anil <robin.anil@gmail.com> wrote:
> > Ok.. A bit more background..
> >
> > An Itemset is a subset I1, I2, I3... In
> >
> > so [I2, I4, I7] is an itemset and the support(no of times its visible in
> the
> > dataset) is say Y
> >
> > A Pattern is Pair<Itemset, support>
> >
> > Take a look at in this format
> >
> > 68:
> >     ([68],90692),
> >     ([17, 68],90683),
> >     ([12, 68],90490),
> >     ([17, 12, 68],90481),
> >     ([18, 68],90291)
> >
> > these are top patterns containing 68 and their support in descending
> order
> > 68 occurs with 12,  90490 times
> >
> > Robin
> >
> >
> > On Mon, Feb 15, 2010 at 6:27 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >>
> >> On Feb 14, 2010, at 11:37 PM, Robin Anil wrote:
> >>
> >> > Each key is a feature and each attribute is the topK frequent patterns
> >> where
> >> > the feature exist
> >>
> >> Still a bit confused.
> >> Given:
> >> Key: 68: Value: ([68],90692), ([17, 68],90683), ([12, 68],90490), ([17,
> 12,
> >> 68],90481), ([18, 68],90291), ([17, 18, 68],90282), ([12, 18,
> 68],90229),
> >> ([17, 12, 18, 68],90220), ([31, 68],89071), ([17, 31, 68],89062), ([12,
> 31,
> >> 68],88874), ([17, 12, 31, 68],88865), ([18, 31, 68],88681), ([17, 18,
> 31,
> >> 68],88672), ([12, 18, 31, 68],88619), ([17, 12, 18, 31, 68],88610),
> ([16,
> >> 68],87933),
> >>
> >> So, 68 is the feature in question.  That makes sense.  Then, what is the
> >> significance of the [] areas, as in [68],90692 or [17,12,68], 90481.
>  Why
> >> all the repetition?
> >>
> >> -Grant
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message