mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: AdaptiveLogisticRegression.close() ArrayIndexOutOfBoundsException
Date Mon, 02 May 2011 18:01:08 GMT
On Mon, May 2, 2011 at 12:39 AM, Tim Snyder <tim@proinnovations.com> wrote:

> ...
> In your options a through c - I am not sure I understand the difference
> between (a) and (c).  Is (a) the current state (let it fail), and (c) a
> small fix to let things complete, but understand that it is probably not
> valuable?
>

The current state is that it fails silently which is not acceptable.

I will think about how to make small training data work well.  It shouldn't
be too hard.

My guess is that this will look a bit more like a batch training interface,
but I am not sure yet.


> Assuming that I could do a previous processing step on the messages,
> similar to spam exclusion, to get to a 1 in 50 or 1 in 20 potential
> interesting msg content, I could develop a larger training dataset.
>

Good.  It is possible to build moderately good models with few dozen
examples, but it is common that
having so little training data limits the sophistication of the models you
can build.


> With only 1 in 10,000 msgs of interest, I don't think I can get to a 10,000
> training set. Any recommendations on how to do this?
>

The key problem is that with a very low hit rate, you have to do a lot of
work to find positive examples.

The general technique you need is called active learning.  This is where the
model helps you find training
data to hand tag.  There are two sub-problems.  One is finding the training
data and the other is dealing with
the fact that you now have a very strangely selected training set that isn't
like the real data.  THe first problem
is the key in most practical situations.


> I am looking at
> Chapter 12 of MIA on clustering of Twitter msgs as a possible way of
> implementing an unsupervised learning for clustering. I would need to
> take this output and be able to discard those clusters (and resultant
> msgs) which are not of interest.
>

This is an excellent way to stratify your search.

You can also use the output from any early models that you build to guide
you.  Sort by score and judge
examples from many different score ranges.  Then re-run the training (but
keep all old training data, of
course).

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message