lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Glen Newton" <glen.new...@gmail.com>
Subject Multi-language support within a single index
Date Thu, 05 Jun 2008 16:14:37 GMT
I would like to be able to get multi-language support within a single index.
I would appreciate input on what I am suggesting:

Assuming that you want something like the following in your document:
Title_english
Title_french
Title_german
Keyword_english
Keyword_french
Keyword_german

Let's pretend for now that each of these was created with a different
appropriate analyzer and the mechanisms for doing this exist (see end
of post for more on this).

How to handle a query?
Could we associate an Analyzer with a set of fields, like this:
// pseudo java
Analyzer ea = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer fa = new FrenchAnalyzer({"TitleFrench", "KeywordFrench"});
Analyzer ga = new EnglishAnalyzer({"TitleEnglish", "KeywordEnglish"});
Analyzer ml = new MultiLanguageAnalyzer();
(MultiLanguageAnalyzer)ml.add(ea);
(MultiLanguageAnalyzer)ml.add(fa);
(MultiLanguageAnalyzer)ml.add(ga);
QueryParser parser = MultiLanguageParser("TitleEnglish", ml);
// end

Now when
  parser.parse("TitleEnglish: foo TitleFrench:bar  smith")
is called, MultiLanguageParser uses the appropriate analyzer for each
field in the query to parse the sub-query & rolls up all of the
queries created by these analyzers into the real query.

I am thinking that this would require having separate term
dictionaries for each language, thus demanding a significant change in
the index format? [Note I am not an expert on Lucene internals]

Of course, something similar to the above could be used adding
documents to the index.

Looking at:
  http://lucene.apache.org/java/docs/fileformats.html#Per-Segment%20Files
It seems that it would need - instead of the present single set - a
set of segment files for each analyzer: .fnm (Fields), tis & tii (term
dictionary), .frq (term frequencies), .prx (positions), .nrm
(normalizations), .tvx, .tvd, .tvf (term vectors).
How stable is the code for this part of the index & would it easily
support this kind of extension? Or would some re-factoring be needed
to make these sorts of manipulations to the nature of the segments
files easier for mere mortal developers?  :-)

Is this something that is already being talked about/looked in
to/being implemented? :-)

thanks,

Glen Newton
http://zzzoot.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message