mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: Naive bayes and character n-grams
Date Thu, 10 Oct 2013 13:19:38 GMT
Dean,

Just a thought.

You should be able to create new language models (with LangDetect) if there's Wikipedia content
for the specific language,
had to do it in the past for Pashto and Malaysian.







On Thursday, October 10, 2013 8:16 AM, Dean Jones <dean.m.jones@gmail.com> wrote:
 
On 10 October 2013 12:46, Ted Dunning <ted.dunning@gmail.com> wrote:
> For language detection, you are going to have a hard time doing better than
> one of the standard packages for the purpose.  See here:
>
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>

Thanks for the pointer Ted. I'm a big fan of the Tika project, we use
it for content extraction already. For various reasons though, we have
rolled our own language detector (mainly, neither of these packages
cover all of the languages we need to identify - language-detection
doesn't do Catalan, Tika doesn't do Welsh).


Dean.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message