lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergiu Gordea <>
Subject Re: one huge index or many small ones?
Date Thu, 04 Nov 2004 18:01:53 GMT
javier muguruza wrote:

 Hi Javier,

 I think the your optimization should take care of the response time of  
search queries. I asume that this is the
variable  you need to optimize. Probably it will be a good thing to read 
first the lucene benchmarks: 

 If you have a mandatory date constraint for each of your indexes you 
can split the index on time basis, I asume that
one index per month will be enough I think ... 10.000 emails I think it 
will be fast enough if you will search in only one index afterwards.
But I think this is not such a good Idea?

What about creating one index per user? If your search require a user or 
a sender, and you can get its name from database, and apply only
the other constrains on an index dedicated to that user .. I think the 
lucene search will be much more faster.

Also the database search will be fast .. I don'T think you will have 
more then 1.000-10.000 user names.

or maybe 1 index/user/year

or 1 index/receiver/year + 1index/sender/year

 What about this solution is it feasible for your system?

 All the best,


>Thanks Erik and Giulio for the fast reply.
>I am just starting to look at lucene so forgive me if I got some ideas
>wrong. I understand your concerns about one index per email. But
>having one index only is also (I guess) out of question.
>I am building an email archive. Email will be kept indefinitely
>available for search, adding new email every day. Imagine a company
>with millions of emails per day (been there), keep it growing for
>years, adding stuff to the index while using it for searches
>That's why my idea is to decide on a time frame (a day, a
>extreme would be an instant, that is a single email, my original idea)
>and build the index for all the email in that timeframe. After the
>timeframe is finished no more stuff will be ever added.
>Before the lucene search emails are selected based on other conditions
>(we store the from, to, date etc in database as well, and these
>conditions are enforced with a sql query first, so I would not need to
>enforce them in the lucene search again, also that query can be quite
>sophisticated and I guess would not be easyly possible to do it in
>lucene by itself). That first db step gives me a group of emails that
>maybe I have to further narrow down based on a lucene search (of body
>and attachment contents). Having an index for more than one emails
>means that after the search I would have to get only the overlaping
>emails from the two searches...Maybe this is better than keeping the
>same info I have in the db in lucene fields as well.
>An example: I want all the email from from Jan
>to Dec containing the word 'money'. I run the db query that returns a
>list with john's email for that period of time, then (lets assume I
>have one index per day) I iterate on every day, looking for emails
>that contain 'money', from the results returned by lucene I keep only
>these that are also in the first list.
>Does that sound better? 
>On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
><> wrote:
>>Hi Javier,
>>I suggest you to build a single index, with all the information you
>>need to find the right mail you are looking for. You than can use
>>Lucene alone to find you messages.
>>Giulio Cesare
>>On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <> wrote:
>>>We are going to move from a just-in-time perl based search to using
>>>lucene in our project. I have to index emails (bodies and also
>>>attachements). I keep in the filesystem all the bodies and attachments
>>>for a long period of time. I have to find emails that fullfil certain
>>>conditions, some of the conditions are take care of at a different
>>>level, so in the end I have a SUBSET of emails I have to run through
>>>I was assuming that the best way would be to create an index for each
>>>email. Having an unique index for a group of emails (say a day worth
>>>of email) seems too coarse grained, imagine a day has 10000 emails,
>>>and some queries will like to look in only a handful of the
>>>emails...But the problem with having one index per emails is the
>>>massive number of emails...imagine having 100000 indexes
>>>Anyway, any idea about that? I just wanted to check wether someones
>>>feels I am wrong.
>>>To unsubscribe, e-mail:
>>>For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message