lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Rondanini <luca.rondan...@gmail.com>
Subject Re: Language Identifier with Lucene?
Date Mon, 24 Oct 2011 16:38:51 GMT
Mead,

it just depends on how many languages you enter in the system!

Collecting the data is not a huge problem: I'm using news websites in 19
languages! The quality of the content is usually high and they "talk" a
lot!

Watch out that the real problem is the encoding: you want to be sure
everything is using the same!

hope this will help,
Luca








On Mon, Oct 24, 2011 at 3:29 AM, Mead Lai <laiqinyi@gmail.com> wrote:

> Luca,
>
> I would like to know: how much language, your system could identify?
> In my view, this difficult part in your system is: how to collect so many
> languages/character in the world for *one person*...
>
> Regards,
> Mead
>
>
> On Sun, Oct 23, 2011 at 1:27 AM, Petite Abeille <petite_abeille@me.com
> >wrote:
>
> >
> > On Oct 22, 2011, at 2:49 AM, Luca Rondanini wrote:
> >
> > > I usually use Nutch for this but, just for fun, I tried to create a
> > language
> > > identifier based on Lucene only.
> >
> > Talking of which:
> >
> > Google's Compact Language Detector
> >
> >
> http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message