lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Vasilev <ivasi...@sirma.bg>
Subject Re: Multy Language documents indexing
Date Fri, 23 Feb 2007 19:00:10 GMT
Thanks Erik,
Here I describe about my research on this problem. It might be helpful 
for someone :)
I will divide the problem with multiple language docs in some subproblems:
*1. Determining the language in the text documents.
1.1. Determining the language in document when the whole text is in one 
and the same language.*
In the Lucene forum I found the following links to sites which provide 
tools for fulfilling this task.
http://odur.let.rug.nl/~vannoord/TextCat/
*_http://frank.spieleck.de/ngram/_*
I have made some tests with the first one and the results for English 
and German are 100% guess but for Russian 0% guess (I used the the 
encoding windows 1251 which is claimed to be supported for the Russian 
text recognition).
Link to this demo 
<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>http://odur.let.rug.nl/~vannoord/TextCat/Demo/

<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/> 
<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>
Other similar sites at: 
<http://odur.let.rug.nl/%7Evannoord/TextCat/Demo/>_http://odur.let.rug.nl/~vannoord/TextCat/competitors.html_

<http://odur.let.rug.nl/%7Evannoord/TextCat/competitors.html>
Note that it is important that when indexing the proper Analyzer will be 
chosen when creating Indexer because when searching with a searcher that 
uses not proper analyzer then the results bight be not correct. Example: 
If we index German document using Lucene GermanAnalyzer and then we 
search some German word in the doc by using StandardAnalyzer in is 
possible the word is not found.

*1.2. Determining the language in document when the some part of the 
text in a document is in one language other in a different language.*
I did not found tools for this neither free not commercial.

*2. How to keep the terms for documents when each document is in 
different language.*
There was a discussion about this in this forum and the approach that I 
best like is the one suggested in the mail 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/%3c20021118032236.35558.qmail@web12704.mail.yahoo.com%3e

<http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200211.mbox/%3C20021118032236.35558.qmail@web12704.mail.yahoo.com%3E>
It suggests to index all the docs in one index no matter which analyzer 
we use.
I have done some tests with indexing in this way (see the attached 
source files and sample text files – names of the sample text files are 
important – they are hard coded in the sources to avoid using language 
recognizer – I did not have enough time to use it.)

*3. The encoding of the documents.*
There is another thing that is important – encoding for plain text 
documents written in different languages. It is important when indexing 
to know the encoding of the plain text documents otherwise the results 
are incorrect. For example when some document is encoded in “ISO-8859-1 
“ and when creating index we wrongly decide it is in “UTF-16 “ then the 
results of searching are wrong.
So may be we have to write also some class(es) that will determine the 
right encoding of the document based on its BOM or if missing on the 
text contained in it.
Just some fun: MS Notepad has a bug in this sense - when the file 
created by Notepad contains exactly one of this texts: “this app can 
break” OR “tuka ima golem bug” (without line separator) then the same 
Notepad can not read it (unlike Wordpad or other programs) :). The 
second in Bulgarian means “here is a big bug”.

Best Regards,
Ivan Vasilev


Erick Erickson wrote:
> I know this has been discussed several times, but sure don't remember the
> answers. Search the mail archive for "multiple languages" and you'll find
> some good suggestions. But as I remember, it's not a trivial issue.
>
> But I don't see why the "three different documents" approach wouldn't 
> work.
> You could also index the same text in three different fields in a single
> document, using different language analyzers for each (See
> PerFieldAnalyzerWrapper).....
>
> Erick
>
> On 2/22/07, Ivan Vasilev <ivasilev@sirma.bg> wrote:
>>
>> Hi All,
>>
>> Our application that uses Lucene for indexing will be used to index
>> documents that each of which contains parts written in different
>> languages. For example some document could contain English, Chinese and
>> Brazilian text. So how to index such document? Is there some best
>> practice to do this?
>>
>> What comes in my mind is to index 3 different Lucene Documents for the
>> real document and keep in a database the meta info that these 3
>> Documents are related to our real doc. For example for the myDoc.doc we
>> will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
>> making search when the searched word is found in myDocCn.doc we will
>> visualize to user myDoc.doc. Disadvantage here is that in this case the
>> occurrences of the searched item will have to be recalculated. It is
>> important for queries like "Red NEAR/10 fox". So if someone knows better
>> practice than this, please let me help.
>>
>> Tanks in advance,
>> Ivan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message