lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Glen Newton" <>
Subject Multi-language support within a single index
Date Thu, 05 Jun 2008 16:14:37 GMT
I would like to be able to get multi-language support within a single index.
I would appreciate input on what I am suggesting:

Assuming that you want something like the following in your document:

Let's pretend for now that each of these was created with a different
appropriate analyzer and the mechanisms for doing this exist (see end
of post for more on this).

How to handle a query?
Could we associate an Analyzer with a set of fields, like this:
// pseudo java
Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer ml = new MultiLanguageAnalyzer();
QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
// end

Now when
  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
is called, MultiLanguageParser uses the appropriate analyzer for each
field in the query to parse the sub-query & rolls up all of the
queries created by these analyzers into the real query.

I am thinking that this would require having separate term
dictionaries for each language, thus demanding a significant change in
the index format? [Note I am not an expert on Lucene internals]

Of course, something similar to the above could be used adding
documents to the index.

Looking at:
It seems that it would need - instead of the present single set - a
set of segment files for each analyzer: .fnm (Fields), tis & tii (term
dictionary), .frq (term frequencies), .prx (positions), .nrm
(normalizations), .tvx, .tvd, .tvf (term vectors).
How stable is the code for this part of the index & would it easily
support this kind of extension? Or would some re-factoring be needed
to make these sorts of manipulations to the nature of the segments
files easier for mere mortal developers?  :-)

Is this something that is already being talked about/looked in
to/being implemented? :-)


Glen Newton

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message