lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 02:17:15 GMT
That sounds a bit more than plausibly good.

Do retrievals work?  Are you sure that you are indexing all of the fields of
interest?

Is maxDoc() plausible?

Do the term vectors for each field look right?

(it is also very helpful to have some test documents with extraordinary
values in key fields so that you can verify indexing and retrieval.  These
are called tracer bullets in some quarters and it is handy to have at least
one such tracer per input file.  You can also add corpus meta-data this way
(n documents for file f).  If you put a special field on these documents you
can include or exclude them from your retrievals with essentially no cost)

On Wed, Jun 24, 2009 at 5:47 PM, pof <MelbourneBeerBaron@gmail.com> wrote:

>
> Hi, I just completed a batch test index of ~1100 documents of various file
> types and I noticed that the original documents take up about 145MB but my
> index is only 1.7MB?? I remember reading somewhere that the typical
> compression rate is about 20-30% or something, but mine is a little over
> 1%!
> I'm not complaining or anything It just struck me a odd especially as I
> have
> a lot of archive files and emails with attachments that I parse as well.
> Has
> anyone else experienced something like this, I'm just curious.
>
> Cheers. Brett.
> --
> View this message in context:
> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message