lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <>
Subject Re: NGram Language Categorization Source
Date Sun, 21 Aug 2005 20:45:15 GMT
> Erhm... Not to rain on your parade, but Googling for "ngram java" gives
> a lot of hits. and also
> "languageidentifier" in Nutch are two examples of Open Source Java
> implementations. Each can be used with Lucene.

I think I've played with ngramj and found it very lacking. 
I haven't played with 'languageidentifier' in Nutch ... 

> A lot depends on the reference profiles (which in turn depend on the
> quality of your training corpus - in this case, your corpus is not the
> best choice, because each text contains a lot of foreign words).

I realize that my corpus isnt' the best.  That's one of the reason's
I've open source'd it.  The main improvement in ngramcat (my code) is
that if the result isn't obvious we throw an Exception so
theoreticallyi we won't see any false positives unless the language
categorization is WAY off.

> It was
> also found that the way you create ngram profiles (e.g. with or without
> surrounding spaces, single length or mixed length) affects the LI
> performance. 


I haven't benchmarked it but I'd be interested in any suggestions you have.

> For documents with mixed languages it was also found that
> methods, which combine ngrams with stopwords, work better.

Hm.. interesting.. where?  URL I can reads?
> Additionally, simple methods based on cosine similarity (or delta
> ranking) don't give correct results for documents with mixed languages.
> In such cases input texts are chunked, and each chunk is analyzed
> separately, and then the scores are combined... etc, etc... millions of
> ways you can do this - and of course no method is perfect. :-)

Yes.  We don't handle the mixed language case very well.  The chunking
method is something I wanted to approach.

> So, there is still a lot to do in this area, if you come up with some
> unique way of improving LI performance...

Maybe I'm being dense but what is LI performance?



 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web -
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message