lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
Date Mon, 21 Jul 2008 10:33:31 GMT


Michael McCandless updated LUCENE-1340:

    Attachment: LUCENE-1340.patch

OK good progress eks!

I started from your latest patch and made some further changes:

  * Fixed DW to not consume RAM writing prx if omitTf==true

  * Fixed FreqProxTermsWriter to not create *.prx file if all fields
    omit term freq.  I added hasProx to SegmentInfo, and changed the
    index file format to store this new boolean.

  * Fixed FreqProxTermsWriterPerField to not write prox into the RAM
    buffer if we will omitTf on flushing the segment to disk.  This
    makes the RAM buffer efficient (no bytes wasted on prox when
    omitTf==true for a field).

  * Added more test cases to TestOmitTf

  * Small whitespace, comment changes

The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and
also in memory because we load *.tii into RAM.  For fields with
omitTf==true this will always be unused, and we could save alot of
disk/RAM if we didn't waste it.

Unfortunately, I think it's too big a change to try to fix this now; I
think we should wait until flex indexing is online.  I wonder how we
can solve it at that point: maybe should we change TermInfo to be
"column stride", meaning, there are separate arrays storing the values
for all terms (ie long[] proxPointers, long[] freqPointers, etc.).
This would also fit the "pluggable" model better, meaning any plugin
can store new stuff (its own arrays) per-term.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>                 Key: LUCENE-1340
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch,
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less
and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene.
This topic has already been discussed and accepted as a part of Flexible Indexing... This
issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations,
user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have
not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message