lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From javier muguruza <jmugur...@gmail.com>
Subject Re: one huge index or many small ones?
Date Thu, 04 Nov 2004 17:03:49 GMT
Thanks Erik and Giulio for the fast reply.

I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.

I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...

That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.

Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.

An example: I want all the email from john.doe@something.com from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.

Does that sound better? 


On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
<giulio.cesare@gmail.com> wrote:
> Hi Javier,
> 
> I suggest you to build a single index, with all the information you
> need to find the right mail you are looking for. You than can use
> Lucene alone to find you messages.
> 
> Giulio Cesare
> 
> 
> 
> 
> On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <jmuguruza@gmail.com> wrote:
> > Hi,
> >
> > We are going to move from a just-in-time perl based search to using
> > lucene in our project. I have to index emails (bodies and also
> > attachements). I keep in the filesystem all the bodies and attachments
> > for a long period of time. I have to find emails that fullfil certain
> > conditions, some of the conditions are take care of at a different
> > level, so in the end I have a SUBSET of emails I have to run through
> > lucene.
> >
> > I was assuming that the best way would be to create an index for each
> > email. Having an unique index for a group of emails (say a day worth
> > of email) seems too coarse grained, imagine a day has 10000 emails,
> > and some queries will like to look in only a handful of the
> > emails...But the problem with having one index per emails is the
> > massive number of emails...imagine having 100000 indexes
> >
> > Anyway, any idea about that? I just wanted to check wether someones
> > feels I am wrong.
> >
> > Thanks
> > 
> > ---------------------------------------------------------------------
> 
> 
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message