mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukáš Vlček <lukas.vl...@gmail.com>
Subject Re: Mahout GSoC 2010 proposal: Association Mining
Date Fri, 09 Apr 2010 11:26:46 GMT
Robin,

I think it does not make sense for me to catch with GSoC timeline now as I
am quite busy with other stuff. However, I will develop the proposal for
Association Mining (or GUHA if you like) and keep this discussion going on.
I am really interested in contributing some implementation to Mahout but as
of now the GCoS timeline is not of any help to me.

Let me look at this in detail and I will get back to mahout community with
more details.

As for the use cases for association mining there can be find a lot examples
in literature. When it comes to missing or negative attributes of the data
(of the transaction) I think there can be a lot of examples as well. One
example would be analysis of click stream, where you can learn that those
people visiting some negative comment on product blog never enter order
form. Not saying this is best example but in general this is the essence of
it. You simply need to take all possible values from the transaction into
account even if it is missing in the market basket. I also remember that
Simpson's paradox is often referred in connection with this issue. As for
GUHA the power of it is that it has well developed theory background. This
for example means that it stated formalized framework for various analysis
that have probably origin in psychology and similar kind of soft-science and
"association-like-functions" between data attributes can be expressed using
4ft table members and user thresholds.

The biggest challenge in implementing this would be the fact that the
analysis have to deal with all the data (not just the most frequent
patterns) and combinations. It is very resource expensive.

I am thinking of starting a wiki page for Mahout about association mining
... this could help, what do you think?

Regards,
Lukas

On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil <robin.anil@gmail.com> wrote:

> PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> refactored out or even a vanilla implementation of FPGrowth is not so
> difficult to re-create by re-using the existing methods.
>
> Robin
>
>
> On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil <robin.anil@gmail.com> wrote:
>
> > Hi Lukas,
> >                Sorry for being late to getting back to you on this.
> > Association rule mining is a great addition to FPGrowth. I am not sure I
> > understand GUHA method well but then again I understood Ted's LLR after
> some
> > deep reading. Could you put up an interesting example to help us
> understand
> > this method. Maybe starting from a transaction of shopping cart item ? A
> > great demo is big plus for a GSOC project.
> >
> > Robin
> >
> >
> > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <lukas.vlcek@gmail.com
> >wrote:
> >
> >> Hello,
> >>
> >> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> >> Association Mining algorithm utilizing existing PFPGrowth implementation
> (
> >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> >>
> >> As for the Assoiciation Mining I would like to implement a very general
> >> algorithm based on old and established method called GUHA [1]. Short and
> >> informative description of GUHA method can be found here [2]. Very
> >> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> >> method has been developing from 60's and is very rich now and given the
> >> limited time during GSoC program it would be wise to reduce scope of
> >> initial
> >> implementation.
> >>
> >> There are two main topics that I would like to focus on during GSoC:
> >>
> >> 1) Enhancing and generalization of PFPGworth:
> >> Frequent pattern mining is usually part of each association maining
> task.
> >> In
> >> Mahout there is existing implementation of PFPGrowth algorithm. I would
> >> like
> >> to utilize this implementaion but it would be necessary to enhance or
> >> generalize it first. The goal here would be to keep current
> functionality
> >> and performance of PFPGrowth but allow more generalized input/output
> data
> >> and conditions. We will need to get frequencies of very rare patterns
> thus
> >> if it will show up that mining only top K is limiting factor then we
> would
> >> need to allow relaxation of this approach. Also we will need to take
> >> account
> >> on negative features (those missing in individual transaction). Negative
> >> features can be either directly supported by implementation of FP
> >> algorithm
> >> or it can be coded at the feature level (decision about what is better
> >> approach will be part of the GSoC program). It should be noted that for
> >> the
> >> GSoC program we will narrow scope of association mining antecedents and
> >> succedents to conjunctions of data features only.
> >>
> >> 2) API for custom association mining functions based on 4ft table:
> >> Association mining in GUHA sense means testing hypothesis on four-fold
> >> table
> >> (4ft) [see 2. item 5]. There has been proposed a lot of association
> >> functions for GUHA, some of them are based on statistical tests (for
> >> example
> >> Fischer test, Chi-square test), some are based on frequent analysis
> while
> >> not explicitly refering to statistical tests but in both cases all
> >> frequencies from four-fold table are needed. Some tests/functions do not
> >> require all frequencies, however; keeping this association mining
> >> implementation general means that we are targeting for all frequencies
> >> from
> >> 4ft. The goal here would be to provide implementation of few GUHA
> >> functions
> >> and design API for custom function based on 4ft table (if the custom
> >> function can be expressed using 4ft table frequencies then it should be
> >> very
> >> easy to implement it for association mining job).
> >>
> >> General use case scenario:
> >> Before the association mining task is executed one would have to decide
> >> which features can be on the left hand side of the rule (antecedents)
> and
> >> which on the right hand side of the rule (succedents). It may be
> practical
> >> to limit number of features on both sides (rules with many features may
> >> not
> >> be very useful). Then a specific test or function for the 4ft table will
> >> be
> >> specified with additional custom treasholds.
> >>
> >> Note: The terminology used in GUHA theory is not be unified. Various
> >> researches used different terminology. This may be a source of
> confusion.
> >>
> >> My background:
> >> I have been working as a full-time Java developer for 9 years.
> Currently,
> >> I
> >> am working for JBoss (thus being paid for working on open source). A few
> >> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> >> association mining is the topic I am focusing on right now. Implementing
> >> association mining in context of Mahout makes sense because it is one of
> >> the
> >> areas of data mining which is not yet covered in Mahout at all.
> MapReduce
> >> implementation can be one possible way how to tackle the challenge of
> >> mining
> >> association rules from large data.
> >>
> >> Regards,
> >> Lukas Vlcek
> >>
> >> [1]
> >>
> >>
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> >> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> >> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
> <
> >> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message