mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Naive bayes and character n-grams
Date Thu, 10 Oct 2013 11:46:06 GMT
For language detection, you are going to have a hard time doing better than
one of the standard packages for the purpose.  See here:

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html


On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones <dean.m.jones@gmail.com> wrote:

> Hi Si,
>
> On 10 October 2013 07:59, <simon.2.thompson@bt.com> wrote:
> >
> > what do you mean by character n-grams? If you mean things like "&ab" or
> "ui2" then given that there are so few characters compared to words is
> there a problem that can't be solved without a look-up table for n<y (where
> y <4ish )
> >
> > Or are you looking at y >4 ish because if so then do you run into the
> issue of a sudden space explosion?
> >
>
> Yes, just tokens in a text broken up into sequences of their constituent
> characters. In my initial tests, language detection works well where n=3,
> particularly when including the head and tail bigrams. So I need something
> to generate the required sequence files from my training data.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message