lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Indexing documents in multiple languages
Date Tue, 27 Jan 2009 20:18:38 GMT
First, I'd search the mail archive for the topic of languages, it's
been discussed often and there's a wealth of information that
might be of benefit, far more information than I can remember.

As to whether your approach will be "too big, too slow...", you
really haven't given enough information to go on. Here are a few
of the questions the answers to which would help: How many
e-mails are you indexing? Are you indexing attachments? How
many users to you expect to be using this system? What
are your target response times? What is your design
queries-per-second? How much dynamic is the index (that is,
how many e-mails do you expect to add per day and what is
the latency you can live with between the time the e-mail is
indexed and when it's searchable)?

If you're indexing 10,000 e-mails, it's one thing. If you're indexing
1,000,000,000 e-mails it's another.


On Tue, Jan 27, 2009 at 3:05 PM, Alejandro Valdez <> wrote:

> Hi, I plan to use solr to index a large number of documents extracted
> from emails bodies, such documents could be in different languages,
> and a single  document could be in more than one language. In the same
> way, the query string could be words in different languages.
> I read that a common approach to index multilingual documents is to
> use some algorithm (n-gram) to determine the document language, then use a
> stemmer and finally index the document in a different index for each
> language.
> As the document language and the query string can't be detected in a
> reliable way, I think that it make not sense to use a stemmer on them
> because a stemmer is tied to a specific language.
> My plan is to index all the documents in the same index, without any
> stemming process (the users will have to search for the exact words that
> they are looking for).
> But I'm not sure if this approach will make the index too big, too
> slow, or if there is a better way to index this kind of documents.
> Any suggestion will be very appreciated.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message