lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dror Matalon <d...@zapatec.com>
Subject Re: Ways to search indexes
Date Thu, 04 Dec 2003 19:54:47 GMT
On Thu, Dec 04, 2003 at 02:59:22PM +0000, jt oob wrote:
>  --- Dror Matalon <dror@zapatec.com> wrote: > On Wed, Dec 03, 2003 at
> 02:49:12PM +0000, jt oob wrote:
> 
> > > > Around 15 gigs. How many days of news?
> > > 
> > > Not sure how many days, but it's around 5 million postings.
> > 
> > So each posting is roughly 3K. More than I would have thought, but
> > not
> > too surprising. 
> > The main reason I asked about how many days, is to get the sense of
> > growth. 15 Gig is a big index, but to understand the performance
> > repercussions the rate of growth is equally important. I suspect that
> > by
> > the time you hit 100 gigs, you'll have one of the biggest indexes
> > around
> > and you'll have to throw quite heavy hardware or distribute the load
> > to 
> > get reasonable performance.
> 
> As I mentioned earlier, I am just treating each posting as plain text
> at the moment. I expect smaller indexes once I separate out header
> fields and body. The most common terms in the index are the standard
> news headers "From", "Date" etc. I'm not sure how much bloat they add,
> but it must be siginificant - not sure how many people would get any
> useful info from searching for "From" anyway!
> Next generation will hopefully have many of the common header fields
> pulled out into document fields.
> 
> The multisearcher is working perfectly :-)

Glad to hear.

> 
> My only concern is that some badly formed encoded attachments in some
> news postings escape my attachement remover as they are invalid. What
> sort of negative impact will long random character strings have on the indexes?

>From my reading of the code, it looks like this is where the max size of
a token is defined:
  private static final int MAX_WORD_LEN = 255;

So, strings over 255 char length are not going to be indexed.

> 
> ________________________________________________________________________
> Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs
> http://www.yahoo.co.uk/robbiewilliams
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message