lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: one huge index or many small ones?
Date Fri, 05 Nov 2004 09:32:13 GMT
javier muguruza wrote:

>Sergiu,
>
>A month could have tens of millions of emails in the worst case, but
>maybe I could discard such bad assumption for our current project.
>Lets say 10000 emails per day max, that makes 300k emails a month.
>Either I would choose one index per day or per month (or week or
>whatever).
>
>Your suggestion about index per user is not valid, my searches do not
>require a user desafortunately. They can maybe say 'all email from
>department C from last week' etc. So, if I choose one index per day(or
>month) I already know that I will have to search in many indexes
>depending on the timeframe (the time frame is the only required value
>for the search)
>
>thanks for the suggestions!
>  
>
If you have time frame and department constraints,  and this repartition 
will result in a 10-30 k emails, probably it will
be a good Idea to create 1 index /dep/month.

Anyway .. changing the structure of the index is no problem in your case 
since you display the infromation from external sources
(database, filesystem...). 
My advice is to create  a function that will build/rebuild the indexes 
for the biging, work with that indexing, test if it fulfils your 
requirements,
and then you can refactor your code in the future. So ... chose one 
solution and implement the first prototype, and keep in mind that your
information is managed by the database, and lucene is just your search 
module.

 Sergiu

>
>On Thu, 04 Nov 2004 19:01:53 +0100, Sergiu Gordea
><gsergiu@ifit.uni-klu.ac.at> wrote:
>  
>
>>javier muguruza wrote:
>>
>>Hi Javier,
>>
>>I think the your optimization should take care of the response time of
>>search queries. I asume that this is the
>>variable  you need to optimize. Probably it will be a good thing to read
>>first the lucene benchmarks:
>>http://jakarta.apache.org/lucene/docs/benchmarks.html.
>><http://jakarta.apache.org/lucene/docs/benchmarks.html>
>>
>>If you have a mandatory date constraint for each of your indexes you
>>can split the index on time basis, I asume that
>>one index per month will be enough I think ... 10.000 emails I think it
>>will be fast enough if you will search in only one index afterwards.
>>But I think this is not such a good Idea?
>>
>>What about creating one index per user? If your search require a user or
>>a sender, and you can get its name from database, and apply only
>>the other constrains on an index dedicated to that user .. I think the
>>lucene search will be much more faster.
>>
>>Also the database search will be fast .. I don'T think you will have
>>more then 1.000-10.000 user names.
>>
>>or maybe 1 index/user/year
>>
>>or 1 index/receiver/year + 1index/sender/year
>>
>>What about this solution is it feasible for your system?
>>
>>All the best,
>>
>> Sergiu
>>
>>
>>
>>    
>>
>>>Thanks Erik and Giulio for the fast reply.
>>>
>>>I am just starting to look at lucene so forgive me if I got some ideas
>>>wrong. I understand your concerns about one index per email. But
>>>having one index only is also (I guess) out of question.
>>>
>>>I am building an email archive. Email will be kept indefinitely
>>>available for search, adding new email every day. Imagine a company
>>>with millions of emails per day (been there), keep it growing for
>>>years, adding stuff to the index while using it for searches
>>>continuously...
>>>
>>>That's why my idea is to decide on a time frame (a day, a month...an
>>>extreme would be an instant, that is a single email, my original idea)
>>>and build the index for all the email in that timeframe. After the
>>>timeframe is finished no more stuff will be ever added.
>>>
>>>Before the lucene search emails are selected based on other conditions
>>>(we store the from, to, date etc in database as well, and these
>>>conditions are enforced with a sql query first, so I would not need to
>>>enforce them in the lucene search again, also that query can be quite
>>>sophisticated and I guess would not be easyly possible to do it in
>>>lucene by itself). That first db step gives me a group of emails that
>>>maybe I have to further narrow down based on a lucene search (of body
>>>and attachment contents). Having an index for more than one emails
>>>means that after the search I would have to get only the overlaping
>>>emails from the two searches...Maybe this is better than keeping the
>>>same info I have in the db in lucene fields as well.
>>>
>>>An example: I want all the email from john.doe@something.com from Jan
>>>to Dec containing the word 'money'. I run the db query that returns a
>>>list with john's email for that period of time, then (lets assume I
>>>have one index per day) I iterate on every day, looking for emails
>>>that contain 'money', from the results returned by lucene I keep only
>>>these that are also in the first list.
>>>
>>>Does that sound better?
>>>
>>>
>>>On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
>>><giulio.cesare@gmail.com> wrote:
>>>
>>>
>>>      
>>>
>>>>Hi Javier,
>>>>
>>>>I suggest you to build a single index, with all the information you
>>>>need to find the right mail you are looking for. You than can use
>>>>Lucene alone to find you messages.
>>>>
>>>>Giulio Cesare
>>>>
>>>>
>>>>
>>>>
>>>>On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <jmuguruza@gmail.com>
wrote:
>>>>
>>>>
>>>>        
>>>>
>>>>>Hi,
>>>>>
>>>>>We are going to move from a just-in-time perl based search to using
>>>>>lucene in our project. I have to index emails (bodies and also
>>>>>attachements). I keep in the filesystem all the bodies and attachments
>>>>>for a long period of time. I have to find emails that fullfil certain
>>>>>conditions, some of the conditions are take care of at a different
>>>>>level, so in the end I have a SUBSET of emails I have to run through
>>>>>lucene.
>>>>>
>>>>>I was assuming that the best way would be to create an index for each
>>>>>email. Having an unique index for a group of emails (say a day worth
>>>>>of email) seems too coarse grained, imagine a day has 10000 emails,
>>>>>and some queries will like to look in only a handful of the
>>>>>emails...But the problem with having one index per emails is the
>>>>>massive number of emails...imagine having 100000 indexes
>>>>>
>>>>>Anyway, any idea about that? I just wanted to check wether someones
>>>>>feels I am wrong.
>>>>>
>>>>>Thanks
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>        
>>>>
>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>
>>
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message