Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 7 Dec 2012 16:27:21 +0000 (UTC)
From: "Adrien Grand (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.12622920.1354896765380.3871.1354897641067@arcas>
In-Reply-To: <JIRA.12622920.1354896765380@arcas>
References: <JIRA.12622920.1354896765380@arcas>
Subject: [jira] [Updated] (LUCENE-4599) Compressed term vectors
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4599:
---------------------------------

    Attachment: LUCENE-4599.patch

Initial patch. It makes term vectors behave like Lucene 4.1 stored fields: one index file which is loaded into memory in a memory-efficient way and one data file that stores the actual term vectors (so 2 files instead of 3 with the current term vectors impl).

All core tests except TestIndexWriter.testEmptyDirRollback pass (because this test expects that there are 3 files for term vectors).

This is only work in progress, I still need to:
 - add tests to try to visit all branches,
 - override the default merge(MergeState) impl

I've tested this patch against 100000 docs from the 1K wikipedia dump, and term vectors were ~20% smaller (I should try against a corpus with bigger docs to get more relevant results).

If you have ideas to efficiently compress term vectors, you're welcome! Currently this patch does nothing crazy and stores terms and positions sequentially:
{code}
term1 - positions for term1 - offsets for term1 - payloads for term1 - term2 - ...{code}

Given that many terms are likely to have a frequency of 1, it might be more efficient to pack the positions/offsets for several terms alltogether(?)
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org