lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <>
Subject Re: Indexing with SnowballAnalyzer and multiple languages in a single index
Date Wed, 26 Apr 2006 02:56:29 GMT wrote:
> You can have multiple languages in the same index.  Just make sure that
> your language identification process is consistent.
> You might still get some false positives, for example, if there's a
> German root that has the same letters as a French root, but means
> something different, then it might still show up.  Personally, I don't
> really know how many times that actually happens.
> Lucene treats all _post-analyze_ tokens the same, it is pretty much
> language ignorant, so as long as the UTF characters are the same, it
> treats the tokens as the same.

I suppose one could work around that by prepending the language code to 
every token.  Then those two words won't match each other, while 
stemming is preserved.

The real problem as I see it is when two languages have an *identical* 
word, and the user types that in as their search query.  Then you have 
to wonder which language it's from... perhaps you would just expand this 
to match multiple languages in the event of multiple matches.  Or 
perhaps you would just add a little drop-down to the place they enter 
their query, where they can indicate what language the query is in.


Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web:                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message