lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: Language detection library
Date Fri, 04 May 2007 05:54:55 GMT

4 maj 2007 kl. 02.20 skrev Chris Lu:

> I suppose if a document is indexed as English or French,
> when users searching the document,
> we need to parse the query as English or French also?

If you do some language specific token analysis such as stemming, yes.

Detecting the language on such small texts is sort of tricky though.  
You might want to introduce more dimensions in the classifier: user  
location, user locale, et c. Perhaps you want to store stemmed data  
in language specific fields. It might also be a good idea to place an  
initial query and re-classifiy to one of the top n scoring language  
and then replace the query.

The easiest way out is to simply ask the user what language they want  
to search in. And that seems to be the most common.


>
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php? 
> title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 5/3/07, karl wettin <karl.wettin@gmail.com> wrote:
>>
>> 3 maj 2007 kl. 22.06 skrev Mordo, Aviran (EXP N-NANNATEK):
>>
>> > Anyone knows of a good language detection library that can  
>> detect what
>> > language a document (text) is ?
>>
>> I posted this some time back:
>>
>> https://issues.apache.org/jira/browse/LUCENE-826
>>
>> A bit of proof-of-concept:ish, but it does the job well if you ask
>> me. Uses Weka (GPL) and requires at least 150 characters to be  
>> trusted.
>>
>>
>> --
>> karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message