lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <rstave...@seseit.com>
Subject RE: Seeing what's occupying all the space in the index
Date Fri, 26 May 2006 17:23:48 GMT
That's a really good idea, but I've got a total of 38 fields only. It is
true that some of them are empty, but that can't account for the bulk.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: 26 May 2006 17:50
To: java-user@lucene.apache.org
Subject: RE: Seeing what's occupying all the space in the index


are you by any chance using different field names for each document -- or do
you have a wide range of field names that aren't the same for each document?
... you mentioned indexing emails, email has a very loose header structure
that allows MTAs to add arbitrary "X" headers, are you converting every
header to an indexed field?

the reason i ask is that anytime you have an indexed field with and you
don't OMIT_NORMS when you add the field, lucene allocates one byte per
document for that field to store the norm value -- even most of those
documents don't have that field.

if you've got ~25,000 documents in your index, and you add a new document
with an indexed field no other document has, then you'll see at least a 25K
increase in index size because of those norms.

(OMIT_NORMS is your friend).

Mime
View raw message