mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: n-grams for terms?
Date Thu, 07 Jan 2010 02:24:59 GMT
Hi Bogdan,

  Sorry about that, I kinda jumped into implementation details related to
what
we've been discussing on the mahout-dev list - much of what I described we
don't
yet have in mahout, and I was talking about some of the ideas on how we can
incorporate it in!

On Wed, Jan 6, 2010 at 5:53 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com>wrote:

> Ok to be hones I do not get all of this - still a newbie in this matter.
> Have some clarification questions:
>
> > The way I've done this is to take whatever unigram analyzer for
>
> tokenization that fits what you want to do, wrap it in Lucene's
>
> ShingleAnalyzer, and use that as the "tokenizer" (which now
>
> produces ngram tokens as single tokens each),
>
>
> you mean to use the unigram analyzer only to feed the Shingle Analyzer (as
> part of lucene?), so the first one is not nativelly connected to lucene
> directly but only produces input for lucene so to say?
>

Mahout has lucene jars pulled in, and we use them for things like analyzing
/
tokenizing text.  For example, included in the lucene jars we link to, there
is
the StandardAnalyzer, which does tokenization, lower-casing, stop word
removal, and some cleanup.  It takes in your text and spits out a stream of
tokens, which for most Analyzer implementations, are unigrams.  Also
available in lucene, is the ShingleAnalyzer, and the Analyzer class itself
allows for chaining, so you can do things like

  Analyzer myAnalyzer = new ShingleAnalyzer(3, new StandardAnalyzer());

(where "3" is just saying you only want up to 3-grams)

What you have at this point is an Analyzer instance which takes in raw
Reader instances (like StringReader), and spits out a stream of ngrams.


>
>
> > and run
> > that
> > through the LLR ngram M/R job (which ends by sorting descending by LLR
> > score),
> > and shove the top-K ngrams (and sometimes the unigrams which fit some
> > "good"
> > IDF range) into a big bloom filter, which is serialized and saved.
> >
>
> what is IDF and LLR


IDF is "Inverse Document Frequency", a weighting scheme used in lucene and
elsewhere in IR and NLP which helps boost the effective weight of terms
which
are more rare (numerically, the IDF of a term which occurs in k out of N
documents is usually taken to be something logarithmic like log( N / k ) ).

LLR is the "Log-Likelihood Ratio", which for token n-grams, is a measure of
how statistically significant each n-gram is (ie. if tokens "A" and "B"
occur
with frequencies f(A) and f(B), the LLR of the bigram "A B" is a measure of
how much more frequently "A B" occurs relative to how much they
frequently they would occur if they were uncorrelated).  In some ways you
can look at IDF as LLR for unigrams.


> do we have that in Mahout? I am only using k-means
> algorithm from Mahout. How would you relate these? I lack some terminology
> though.
>

What I described is a pipeline taking text, tokenizing and finding the top
statistically significant ngrams (you don't want all of them - most are
either
fairly rare, or are just the combination of a common unigram and another
unigram), and then making Mahout SparseVector instances out of the
documents, which you can then feed to k-means.

and I could possibly do that one - write some small algorithm which finds
> n-grams and feeds the initial term vectors before k-means task.
> Is what you explained close to what Ted suggested or are they different
> approaches?
>

I was basically describing what Ted was talking about, just in a bit more
intricate detail.  Using a simple ShingleAnalyzer wrapped around a
StandardAnalyzer from lucene and you've got your "little algorithm"
to pull out ngrams which can be fed to k-means.  You will just get a lot
of noise if you don't do a little bit of work to make sure you only have
the "good" ngrams.

I would really like to have this whole pipeline in Mahout, in a totally
user-friendly way, but while the pieces exist, the whole pipleline hasn't
been put together quite yet (although you could certainly write it
yourself - we love patches! :) )

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message