lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: Seeing what's occupying all the space in the index
Date Fri, 26 May 2006 16:50:20 GMT

are you by any chance using different field names for each document -- or
do you have a wide range of field names that aren't the same for each
document? ... you mentioned indexing emails, email has a very loose header
structure that allows MTAs to add arbitrary "X" headers, are you
converting every header to an indexed field?

the reason i ask is that anytime you have an indexed field with and you
don't OMIT_NORMS when you add the field, lucene allocates one byte per
document for that field to store the norm value -- even most of those
documents don't have that field.

if you've got ~25,000 documents in your index, and you add a new document
with an indexed field no other document has, then you'll see at least a
25K increase in index size because of those norms.

(OMIT_NORMS is your friend).



: Date: Fri, 26 May 2006 15:50:43 +0100
: From: "Rob Staveley (Tom)" <rstaveley@seseit.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: RE: Seeing what's occupying all the space in the index
:
:  > I can't see how Luke is going to show me what's occupying most of my
: index.
:
: I do however notice that none of my stored fields are stored compressed.
: Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and
: wasn't available in 1.4.3??  However, it is still hard to see what's causing
: the index to grow by 25kB with each document I index.
:
: Is there anything I can learn from the index directory's file listing?
: Here's the top of file listing in descending order of size from the index
: directory:
:
: --------8<--------
: rob@dev:~/dat/indexd/index-1$ ls -lS | more
: total 95374800
: -rw-r--r--    1 rob dev 4449248724 May 26 00:32 _2fyfe.cfs
: -rw-r--r--    1 rob dev 2522952122 May 26 14:17 _2k5vi.fdt
: -rw-r--r--    1 rob dev 2413775516 May 26 01:16 _2g6l3.fdt
: -rw-r--r--    1 rob dev 2368881846 May 25 18:14 _2ehe7.fdt
: -rw-r--r--    1 rob dev 2344670598 May 25 16:31 _2dn69.fdt
: -rw-r--r--    1 rob dev 2324315860 May 25 14:10 _2cxea.fdt
: -rw-r--r--    1 rob dev 2259070168 May 25 10:28 _2aeb0.fdt
: -rw-r--r--    1 rob dev 2113143078 May 24 13:39 _24xxv.fdt
: -rw-r--r--    1 rob dev 2005876265 May 23 16:47 _21143.fdt
: -rw-r--r--    1 rob dev 1994658169 May 23 16:09 _20mgq.fdt
: -rw-r--r--    1 rob dev 1991402285 May 23 14:21 _20id9.fdt
: -rw-r--r--    1 rob dev 1973739578 May 23 12:17 _1zvpr.fdt
: -rw-r--r--    1 rob dev 1964392156 May 23 11:14 _1zjra.fdt
: -rw-r--r--    1 rob dev 1957195484 May 23 10:27 _1zanx.fdt
: -rw-r--r--    1 rob dev 1940435968 May 12 13:58 _1374v.cfs
: -rw-r--r--    1 rob dev 1932876050 May 23 08:34 _1yep1.fdt
: -rw-r--r--    1 rob dev 1908759860 May 22 22:26 _1xhs9.fdt
: -rw-r--r--    1 rob dev 1862224271 May 22 14:24 _1vxc8.fdt
: -rw-r--r--    1 rob dev 1775652952 May 21 19:15 _1slqr.fdt
: -rw-r--r--    1 rob dev 1674336305 May 20 10:10 _1oaje.fdt
: -rw-r--r--    1 rob dev 1641906176 May 17 12:24 _1f3lp.cfs
: -rw-r--r--    1 rob dev 1626645390 May 19 16:39 _1mjaa.fdt
: -rw-r--r--    1 rob dev 1412089155 May 17 12:21 _1f3lp.fdt
: -rw-r--r--    1 rob dev 1400399872 May 19 16:51 _1mjaa.cfs
: -rw-r--r--    1 rob dev 1090611130 May 12 13:51 _1374v.fdt
: -rw-r--r--    1 rob dev 1059463168 May 16 09:24 _1azb3.fdt
: -rw-r--r--    1 rob dev 1052419072 May 11 16:36 _1168u.cfs
: -rw-r--r--    1 rob dev 1034522679 May 11 16:34 _1168u.fdt
: -rw-r--r--    1 rob dev 1023033344 May 23 14:51 _20m4q.fdt
: -rw-r--r--    1 rob dev 857224192 May  4 18:16 _lluq.cfs
: -rw-r--r--    1 rob dev 821198047 May 26 14:21 _2k5vi.prx
: -rw-r--r--    1 rob dev 808694510 May  8 14:09 _sunn.fdt
: -rw-r--r--    1 rob dev 786164503 May 26 01:26 _2g6l3.prx
: -rw-r--r--    1 rob dev 780030917 May 26 14:21 _2k5vi.frq
: -rw-r--r--    1 rob dev 772484883 May 25 18:19 _2ehe7.prx
: -rw-r--r--    1 rob dev 763351845 May 25 16:41 _2dn69.prx
: -rw-r--r--    1 rob dev 755794097 May 25 14:14 _2cxea.prx
: -rw-r--r--    1 rob dev 745972375 May 26 01:26 _2g6l3.frq
: -rw-r--r--    1 rob dev 732753582 May 25 10:45 _2aeb0.prx
: -rw-r--r--    1 rob dev 732496935 May 25 18:19 _2ehe7.frq
: -rw-r--r--    1 rob dev 724428884 May 25 16:41 _2dn69.frq
: -rw-r--r--    1 rob dev 719733760 May 25 09:49 _29vyk.fdt
: -rw-r--r--    1 rob dev 717613127 May 25 14:14 _2cxea.frq
: -rw-r--r--    1 rob dev 696849854 May 25 10:45 _2aeb0.frq
: -rw-r--r--    1 rob dev 686227498 May 24 13:59 _24xxv.prx
: --------8<--------
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message