lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Multiple Languages with Lucene (Arabic & English)
Date Tue, 24 Jul 2007 12:10:53 GMT

On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:

> Hi
>
>
>
> I'm new to searching and am trying to use Lucene to search English  
> & Arabic
> documents.  I've got a bunch of questions (hopefully you'll find some
> interesting!) and am hoping someone's gone through some of them and  
> has some
> answers for me!
>
>
>
> First, do I have to worry about the Arabic Analyzer overwriting the  
> index
> files of the English analyzer? (Or vice versa?)
>
> i.e. When I index documents a second time, will data be overwritten?
>

That depends whether you tell Lucene to create a new index or not.   
See the IndexWriter API for your options.

>
>
> I could just store the index files for different languages in a  
> different
> location, but it's good to know and I'd rather not if I don't have  
> to :)
>
>
>
> Also, on the same note, if I'm indexing documents that contain both  
> Arabic
> and English, will the index files created by the English (or Arabic)
> analyzer contain garbage or become corrupted because of the language
> difference?
>

I don't know if it will be corrupted, but probably won't be all that  
useful, either. You may find the PerFieldAnalyzerWrapper to be helpful.

>
>
> It is possible to index (using an English/Latin/Standard analyzer)  
> a file
> that contains both english and arabic words, and expect the  
> searches in
> English using the same analyzer to be valid, right?

I should think so.  I don't recall running across this case too much,  
but do remember the reverse, Arabic files w/ some English and the  
Arabic analyzer usually just skipped over the English leaving it  
intact, thus searching those English terms in the Arabic index worked  
just fine.

>
>
>
> In an Arabic document with a single English word (the name of a  
> corporation,
> for example) will the English word even be indexed and located by a  
> search?
> I could test something like this with a small subset of documents,  
> but I
> doubt the actual usefulness of a test with such a tiny (relatively
> speaking!) amount of data.. I know we can tell Lucene to store the  
> full copy
> of the document, but does that affect the index itself?
>
>
>
> Finally, and here's the tricky one, are searches that contain both  
> English
> and Arabic words valid?  My limited understanding of the way search  
> engines
> work tells me the search analyzes the context of words as well as
> statistical data to decide the relevance of hits, is this still  
> valid for
> multi-lingual searches?
>
>

They are valid, just not sure how useful, but that is for your app to  
decide.  I guess if your users know both Arabic and English, it  
probably isn't a big deal.  Lucene just tries to match up what is in  
the query w/ what is in the index, so if you have validly analyzed  
tokens in both the query and the index then Lucene should find them.

HTH,
Grant

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message