lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index
Date Sat, 19 Jul 2008 09:34:31 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

Thanks eks, that was fast -- I think you set a new record!

The patch looks good, though we definitely need some solid unit tests
here.  I made some small (whitespace, spelling, naming) corrections &
attached a new rev of the patch.

One question I have: right now if a single field has mixed true/false
for omitTf, you set it to false, meaning we start storing the term
freq, pos, payloads again.  Can/should we do the reverse instead?  If
we did, we could make some further optimizations, eg right now we
consume RAM storing all positions/payloads on a field that has omitTF=true
on the possibility that we may stll see omitTf=false in the same session.

With this patch we still store the *.prx bytes for a field with
omitTf=true.  Can you fix that?  I think in FreqProxTermsWriter you
can simply not write any bytes to the proxOut; likewise in
SegmentMerger and SegmentTermPositions, don't try to read bytes from
the prx file if omitTf==true.

I'd also be curious about what gains in index size & filter
performance we see with these new boolean fields.


> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less
and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene.
This topic has already been discussed and accepted as a part of Flexible Indexing... This
issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations,
user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have
not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message