lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <rstave...@seseit.com>
Subject RE: Seeing what's occupying all the space in the index
Date Fri, 26 May 2006 17:14:04 GMT
> Is there anything I can learn from the index directory's file listing?

Running this nasty little BASH one-liner...

$ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort |
uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print
"'$i': ", SUM}}'; done       

... I see....

	.cfs:  1.23155e+10
	.fdt:  5.06108e+10
	.frq:  1.27472e+10
	.prx:  1.3444e+10

That means I have 98 GB of files, with: 

	51 GB devoted to field data (.fdt), 
	13 BG devoted to term positions (.prx)
	13 BG devoted to term frequencies (.frq)
	12 BG devoted to compound files for the field index (.cfs)

Does that seem reasonable, bearing in mind I have only indexed 4.3 million
Lucene documents? That's 22.8 kB per Lucene document, and apart from a 300
character synopsis the fields are all much less than 100 characters long,
and yet this suggests that the index is providing 600 bytes per field.


Mime
View raw message