lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4599) Compressed term vectors
Date Fri, 07 Dec 2012 16:27:21 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4599:
---------------------------------

    Attachment: LUCENE-4599.patch

Initial patch. It makes term vectors behave like Lucene 4.1 stored fields: one index file
which is loaded into memory in a memory-efficient way and one data file that stores the actual
term vectors (so 2 files instead of 3 with the current term vectors impl).

All core tests except TestIndexWriter.testEmptyDirRollback pass (because this test expects
that there are 3 files for term vectors).

This is only work in progress, I still need to:
 - add tests to try to visit all branches,
 - override the default merge(MergeState) impl

I've tested this patch against 100000 docs from the 1K wikipedia dump, and term vectors were
~20% smaller (I should try against a corpus with bigger docs to get more relevant results).

If you have ideas to efficiently compress term vectors, you're welcome! Currently this patch
does nothing crazy and stores terms and positions sequentially:
{code}
term1 - positions for term1 - offsets for term1 - payloads for term1 - term2 - ...{code}

Given that many terms are likely to have a frequency of 1, it might be more efficient to pack
the positions/offsets for several terms alltogether(?)
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message