lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Single Analyzer for multiple European languages
Date Mon, 26 Sep 2005 17:51:00 GMT

I plan to use lucene to index documents in multiple languages (ie.
each document in more than one European language) as follows.

- Before indexing find the language of the document (using Nutch's
Language Identifier)
- Use the Analyzer for that language to index the document. Analyzer
will be constructed with stopwords for that language. Stemming will
NOT be used for any language.
- All the documents go to one single index.
- Remember all the languages encountered while creating the index.

- Get the superset of stopwords by merging the stopwords from all the languages.
- Create an Analyzer with this list of stopwords
- Use this analyzer for all the search queries

I have read that one should use the same analyzer during search as the
one used to create the index.  I am clearly deviating from this rule.
But since I am not using any  language-specific filter, this looks
correct to me. (If in future need arises to restrict results from a
particular language, I plan to add another field in each document for
language and use it in the query.)

*  While getting the details right, am I falling to a grand fallacy?
Is there any basic assumption in my thinking which is patently wrong?

* Curious question: Support for CJK - Since StandardAnalyzer() is good
enough for major European languages, I can use a different index for
CJK built with a CJK analyzer,  or potentially different for each of
C, J and K. To make things simple, let's say only one of these indices
will be used to search at a time (so as to avoid complications of
merging results from multiple indices). Is this solution correct?

Thanks in advance.


"Speed is subsittute fo accurancy."

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message