lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Swanhart <>
Subject Re: one huge index or many small ones?
Date Thu, 04 Nov 2004 17:28:18 GMT
First off, I think you should make a decision about what you want to
store in your index and how you go about searching it.

The less information you store in your index, the better, for
performance reasons.  If you can store the messages in an external
database you probably should.  I would create a table that contains a
clob and an associated id that can be used to get the message at any

Assuming mail is in SMTP RFC format:

I would suggest:
Unstored: Subject
Keyword: From
Keyword: To
Stored,Unindexed: ID  <-- this would be the ID to the message in your database
Unstored: Body 
Keyword: Month
Keyword: Day
Keyword: Year
(and any other keywords you might use)

Your lucene query would then look something like: +(Subject:money Body:money) +Year:2004

Use the stored ID field to get the message contents from your database.

If you want to break your index down into multiple indexes, based on
some criteria such as time frame you could do that too.  You would
then use a MultiSearcher or ParallelMultiSearcher to process the
multiple indexes.

On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza <> wrote:
> Thanks Erik and Giulio for the fast reply.
> I am just starting to look at lucene so forgive me if I got some ideas
> wrong. I understand your concerns about one index per email. But
> having one index only is also (I guess) out of question.
> I am building an email archive. Email will be kept indefinitely
> available for search, adding new email every day. Imagine a company
> with millions of emails per day (been there), keep it growing for
> years, adding stuff to the index while using it for searches
> continuously...
> That's why my idea is to decide on a time frame (a day, a
> extreme would be an instant, that is a single email, my original idea)
> and build the index for all the email in that timeframe. After the
> timeframe is finished no more stuff will be ever added.
> Before the lucene search emails are selected based on other conditions
> (we store the from, to, date etc in database as well, and these
> conditions are enforced with a sql query first, so I would not need to
> enforce them in the lucene search again, also that query can be quite
> sophisticated and I guess would not be easyly possible to do it in
> lucene by itself). That first db step gives me a group of emails that
> maybe I have to further narrow down based on a lucene search (of body
> and attachment contents). Having an index for more than one emails
> means that after the search I would have to get only the overlaping
> emails from the two searches...Maybe this is better than keeping the
> same info I have in the db in lucene fields as well.
> An example: I want all the email from from Jan
> to Dec containing the word 'money'. I run the db query that returns a
> list with john's email for that period of time, then (lets assume I
> have one index per day) I iterate on every day, looking for emails
> that contain 'money', from the results returned by lucene I keep only
> these that are also in the first list.
> Does that sound better?
> On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
> <> wrote:
> > Hi Javier,
> >
> > I suggest you to build a single index, with all the information you
> > need to find the right mail you are looking for. You than can use
> > Lucene alone to find you messages.
> >
> > Giulio Cesare
> >
> >
> >
> >
> > On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <> wrote:
> > > Hi,
> > >
> > > We are going to move from a just-in-time perl based search to using
> > > lucene in our project. I have to index emails (bodies and also
> > > attachements). I keep in the filesystem all the bodies and attachments
> > > for a long period of time. I have to find emails that fullfil certain
> > > conditions, some of the conditions are take care of at a different
> > > level, so in the end I have a SUBSET of emails I have to run through
> > > lucene.
> > >
> > > I was assuming that the best way would be to create an index for each
> > > email. Having an unique index for a group of emails (say a day worth
> > > of email) seems too coarse grained, imagine a day has 10000 emails,
> > > and some queries will like to look in only a handful of the
> > > emails...But the problem with having one index per emails is the
> > > massive number of emails...imagine having 100000 indexes
> > >
> > > Anyway, any idea about that? I just wanted to check wether someones
> > > feels I am wrong.
> > >
> > > Thanks
> > >
> > > ---------------------------------------------------------------------
> >
> >
> > > To unsubscribe, e-mail:
> > > For additional commands, e-mail:
> > >
> > >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message