lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Multi-language support within a single index
Date Thu, 05 Jun 2008 16:54:38 GMT
I'm not sure what you're getting at, but it seems awful similar to
PerFieldAnalyzerWrapper that already exists and does (it seems
to me on a quick scan) to do exactly what you want. And it
works for both indexing and querying out-of-the-box.....

Best
Erick

On Thu, Jun 5, 2008 at 12:14 PM, Glen Newton <glen.newton@gmail.com> wrote:

> I would like to be able to get multi-language support within a single
> index.
> I would appreciate input on what I am suggesting:
>
> Assuming that you want something like the following in your document:
> Title_english
> Title_french
> Title_german
> Keyword_english
> Keyword_french
> Keyword_german
>
> Let's pretend for now that each of these was created with a different
> appropriate analyzer and the mechanisms for doing this exist (see end
> of post for more on this).
>
> How to handle a query?
> Could we associate an Analyzer with a set of fields, like this:
> // pseudo java
> Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
> Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
> Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
> Analyzer ml = new MultiLanguageAnalyzer();
> (MultiLanguageAnalyzer)ml.add(ea);
> (MultiLanguageAnalyzer)ml.add(fa);
> (MultiLanguageAnalyzer)ml.add(ga);
> QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
> // end
>
> Now when
>  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
> is called, MultiLanguageParser uses the appropriate analyzer for each
> field in the query to parse the sub-query & rolls up all of the
> queries created by these analyzers into the real query.
>
> I am thinking that this would require having separate term
> dictionaries for each language, thus demanding a significant change in
> the index format? [Note I am not an expert on Lucene internals]
>
> Of course, something similar to the above could be used adding
> documents to the index.
>
> Looking at:
>  http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files
> It seems that it would need - instead of the present single set - a
> set of segment files for each analyzer: .fnm (Fields), tis & tii (term
> dictionary), .frq (term frequencies), .prx (positions), .nrm
> (normalizations), .tvx, .tvd, .tvf (term vectors).
> How stable is the code for this part of the index & would it easily
> support this kind of extension? Or would some re-factoring be needed
> to make these sorts of manipulations to the nature of the segments
> files easier for mere mortal developers?  :-)
>
> Is this something that is already being talked about/looked in
> to/being implemented? :-)
>
> thanks,
>
> Glen Newton
> http://zzzoot.blogspot.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message