lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: NGram Language Categorization Source
Date Mon, 22 Aug 2005 07:51:43 GMT
Kevin Burton wrote:

>>A lot depends on the reference profiles (which in turn depend on the
>>quality of your training corpus - in this case, your corpus is not the
>>best choice, because each text contains a lot of foreign words).
> 
> 
> I realize that my corpus isnt' the best.  That's one of the reason's
> I've open source'd it.  The main improvement in ngramcat (my code) is
> that if the result isn't obvious we throw an Exception so
> theoreticallyi we won't see any false positives unless the language
> categorization is WAY off.

That's also how other implementations do it - you need to set an 
arbitrary threshold, and if the profiles score below that threshold then 
an "unknown" value is returned (or null, or Exception).

> 
> 
>>It was
>>also found that the way you create ngram profiles (e.g. with or without
>>surrounding spaces, single length or mixed length) affects the LI
>>performance. 
> 
> 
> LI???
> 

LI = Language Identification. Sorry for the confusion.

> I haven't benchmarked it but I'd be interested in any suggestions you have.
> 
> 
>>For documents with mixed languages it was also found that
>>methods, which combine ngrams with stopwords, work better.
> 
> 
> Hm.. interesting.. where?  URL I can reads?

Someone mentioned the Linguini paper, where they found that using "short 
words" features gives similarly good performance as using ngrams.

See also the following papers :

http://www.xrce.xerox.com/Publications/Attachments/1995-012/Gref---Comparing-two-language-identification-schemes.pdf
http://citeseer.ist.psu.edu/40861.html
http://www.xs4all.nl/~ajwp/langident.pdf

In general, using stop words works only for texts above certain minimum 
length (greater than with n-gram methods), and then gives nearly 100% 
accuracy.


>>So, there is still a lot to do in this area, if you come up with some
>>unique way of improving LI performance...
> 
> 
> Maybe I'm being dense but what is LI performance?

Language Identification performance - in the sense that a given 
identifier "performs" better if it can correctly identify more 
languages, using shorter input text.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message