Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4130E08A for ; Fri, 7 Dec 2012 16:27:22 +0000 (UTC) Received: (qmail 32317 invoked by uid 500); 7 Dec 2012 16:27:21 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 32141 invoked by uid 500); 7 Dec 2012 16:27:21 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 32112 invoked by uid 99); 7 Dec 2012 16:27:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 16:27:21 +0000 Date: Fri, 7 Dec 2012 16:27:21 +0000 (UTC) From: "Adrien Grand (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (LUCENE-4599) Compressed term vectors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4599: --------------------------------- Attachment: LUCENE-4599.patch Initial patch. It makes term vectors behave like Lucene 4.1 stored fields: one index file which is loaded into memory in a memory-efficient way and one data file that stores the actual term vectors (so 2 files instead of 3 with the current term vectors impl). All core tests except TestIndexWriter.testEmptyDirRollback pass (because this test expects that there are 3 files for term vectors). This is only work in progress, I still need to: - add tests to try to visit all branches, - override the default merge(MergeState) impl I've tested this patch against 100000 docs from the 1K wikipedia dump, and term vectors were ~20% smaller (I should try against a corpus with bigger docs to get more relevant results). If you have ideas to efficiently compress term vectors, you're welcome! Currently this patch does nothing crazy and stores terms and positions sequentially: {code} term1 - positions for term1 - offsets for term1 - payloads for term1 - term2 - ...{code} Given that many terms are likely to have a frequency of 1, it might be more efficient to pack the positions/offsets for several terms alltogether(?) > Compressed term vectors > ----------------------- > > Key: LUCENE-4599 > URL: https://issues.apache.org/jira/browse/LUCENE-4599 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/termvectors > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4599.patch > > > We should have codec-compressed term vectors similarly to what we have with stored fields. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org