mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Multiclass classifier - hunting for a small one
Date Fri, 14 Jan 2011 16:56:21 GMT
Hi Lance,

On Jan 14, 2011, at 1:04am, Lance Norskog wrote:

> Here's the use case: deciding the language of a mid-size document like
> a newspaper article or a technical report. The problem has been
> tackled fairly successfully by pulling 2- and 3-letter sequences from
> bodies of text in various languages, and comparing the set of 2- and
> 3-letter sequences from the document.
> This would be for text indexing in Lucene, so it should be
> memory-resident. The implementation should have a small dataset. It is
> better if the computation is front-loaded, like video compression: the
> heavy lifting happens in a model preparation phase, and then working
> from the model is fast. A confidence rating for the classification
> would be nice.
> Open license (Apache-compatible) code would be great, as are
> non-patented algorithms.
> Any suggestions?

I can't currently recommend the language detector in Tika - see

  for details.

That issue has a link to a review of other options, though it's  
slightly dated.

Want to code up the LLR-based approach that Ted described in the PDF  
attached to the issue? :)

That would be a killer contribution...

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message