lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jt oob <>
Subject Re: Ways to search indexes
Date Thu, 04 Dec 2003 14:59:22 GMT
 --- Dror Matalon <> wrote: > On Wed, Dec 03, 2003 at
02:49:12PM +0000, jt oob wrote:

> > > Around 15 gigs. How many days of news?
> > 
> > Not sure how many days, but it's around 5 million postings.
> So each posting is roughly 3K. More than I would have thought, but
> not
> too surprising. 
> The main reason I asked about how many days, is to get the sense of
> growth. 15 Gig is a big index, but to understand the performance
> repercussions the rate of growth is equally important. I suspect that
> by
> the time you hit 100 gigs, you'll have one of the biggest indexes
> around
> and you'll have to throw quite heavy hardware or distribute the load
> to 
> get reasonable performance.

As I mentioned earlier, I am just treating each posting as plain text
at the moment. I expect smaller indexes once I separate out header
fields and body. The most common terms in the index are the standard
news headers "From", "Date" etc. I'm not sure how much bloat they add,
but it must be siginificant - not sure how many people would get any
useful info from searching for "From" anyway!
Next generation will hopefully have many of the common header fields
pulled out into document fields.

The multisearcher is working perfectly :-)

My only concern is that some badly formed encoded attachments in some
news postings escape my attachement remover as they are invalid. What
sort of negative impact will long random character strings have on the indexes?

Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message