lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barry Forrest" <bforres...@gmail.com>
Subject Re: Optimizing index takes too long
Date Mon, 12 Nov 2007 22:42:33 GMT
On Nov 12, 2007 1:15 PM, J.J. Larrea <jjl@panix.com> wrote:

>
> 2. Since the full document and its longer bibliographic subfields are
> being indexed but not stored, my guess is that the large size of the index
> segments is due to the inverted index rather than the stored data fields.
>  But you can roughly verify by checking the size of the files in the index,
> with Luke's Files tab or simply an ls -l.  For example .fdt files are stored
> data while .tis are the inverted index; see
> http://lucene.apache.org/java/docs/fileformats.html  And if you have .cfs
> files...


OK here are some of the details of one index directory.  The number of
indexed documents is approximately 1.5million.

$ ls -1 |wc -l
29542

$ du -hs .
90G     .

$ // Show space used (in GB) by file extension.
$ for filetype in `ls -1 | sed -r "s/.*\.(.*)/\1/" | sort -u` ;\
  do echo -n "filetype=$filetype: " ; (echo -n "scale=2; \
  (" ; (for size in `ls -l *.$filetype| cut -c 24-34`; do \
  echo -n "$size+"; done) ; echo "0)/10^9") | bc; done

filetype=fdt: .44
filetype=fdx: .01
filetype=fnm: .01
filetype=frq: 15.95
filetype=lock: 0
filetype=nrm: .03
filetype=prx: 70.94
filetype=tii: .12
filetype=tis: 8.79


So most of the space is occupied with .prx files.

Thanks
Barry

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message