lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index
Date Sat, 26 Jul 2008 09:47:31 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140
] 

Eks Dev commented on LUCENE-1340:
---------------------------------

we  finished our tests

Index without omitTf() :
- 87Mio Documents, 2 indexed Fields one stored field
- Unique terms in index 2.5Mio
- Average Field lengths in tokens: 3.3 and 5.5 (very short fields)
- On Disk size 3.8 Gb total with stored field
 
Queries under test: 
- BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with minNumberShouldMatch())
. with a lot of clauses (5-100).
- Filter used, yes

Test scope, regression with 30k Queries on the same index with omitTf(true/false).

Result:

- The Queries returned 100% identical Hits (full recall tested, all hits checked)!

- Index size reduction(not including stored field!): 7% (short documents => less positions
than in Mike's case)

- Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on disk setup should
bring even more due to the reduced IO for reading postings)

-Indexing performance (FSDisk!) 13% faster

Also, we compared omitTf(false) with this patch and lucene.jar without this patch, no changes
whatsoever.

>From my perspective, this is good to go into production. At least for our usage of lucene,
there are no differences with homitTf(true)... 

>One more thing here: since the tiis are loaded into RAM, that unused proxPointer wastes
8 bytes for each indexed terms. For indices with alot of terms this can add up to alot of
wasted ram. But still I think we should wait and fix this as part of flexible indexing, when
we maybe refactor the TermInfos to be "column stride" instead.

I am more than happy with the results, no need to squeeze the last bit out of it right now.

Mike, thanks again for the great work! 



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch,
LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less
and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene.
This topic has already been discussed and accepted as a part of Flexible Indexing... This
issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations,
user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have
not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message