lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Glen Newton" <glen.new...@gmail.com>
Subject Re: Multi-language support within a single index
Date Thu, 05 Jun 2008 17:07:01 GMT
Yes, thank-you for the pointer, and apologies for not doing my
homework better. :-)
It is exactly what I want.

The scenario is where I have articles which tend to be in english and
have abstracts, and for some of them have french language abstracts.
Users may want to search the english abstracts or the french abstracts
or both.

thanks,

Glen

2008/6/5 Erick Erickson <erickerickson@gmail.com>:
> I'm not sure what you're getting at, but it seems awful similar to
> PerFieldAnalyzerWrapper that already exists and does (it seems
> to me on a quick scan) to do exactly what you want. And it
> works for both indexing and querying out-of-the-box.....
>
> Best
> Erick
>
> On Thu, Jun 5, 2008 at 12:14 PM, Glen Newton <glen.newton@gmail.com> wrote:
>
>> I would like to be able to get multi-language support within a single
>> index.
>> I would appreciate input on what I am suggesting:
>>
>> Assuming that you want something like the following in your document:
>> Title_english
>> Title_french
>> Title_german
>> Keyword_english
>> Keyword_french
>> Keyword_german
>>
>> Let's pretend for now that each of these was created with a different
>> appropriate analyzer and the mechanisms for doing this exist (see end
>> of post for more on this).
>>
>> How to handle a query?
>> Could we associate an Analyzer with a set of fields, like this:
>> // pseudo java
>> Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
>> Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
>> Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
>> Analyzer ml = new MultiLanguageAnalyzer();
>> (MultiLanguageAnalyzer)ml.add(ea);
>> (MultiLanguageAnalyzer)ml.add(fa);
>> (MultiLanguageAnalyzer)ml.add(ga);
>> QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
>> // end
>>
>> Now when
>>  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
>> is called, MultiLanguageParser uses the appropriate analyzer for each
>> field in the query to parse the sub-query & rolls up all of the
>> queries created by these analyzers into the real query.
>>
>> I am thinking that this would require having separate term
>> dictionaries for each language, thus demanding a significant change in
>> the index format? [Note I am not an expert on Lucene internals]
>>
>> Of course, something similar to the above could be used adding
>> documents to the index.
>>
>> Looking at:
>>  http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files
>> It seems that it would need - instead of the present single set - a
>> set of segment files for each analyzer: .fnm (Fields), tis & tii (term
>> dictionary), .frq (term frequencies), .prx (positions), .nrm
>> (normalizations), .tvx, .tvd, .tvf (term vectors).
>> How stable is the code for this part of the index & would it easily
>> support this kind of extension? Or would some re-factoring be needed
>> to make these sorts of manipulations to the nature of the segments
>> files easier for mere mortal developers?  :-)
>>
>> Is this something that is already being talked about/looked in
>> to/being implemented? :-)
>>
>> thanks,
>>
>> Glen Newton
>> http://zzzoot.blogspot.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message