Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 22070 invoked from network); 21 Jul 2008 18:44:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Jul 2008 18:44:25 -0000 Received: (qmail 75435 invoked by uid 500); 21 Jul 2008 18:44:23 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 75188 invoked by uid 500); 21 Jul 2008 18:44:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 75179 invoked by uid 99); 21 Jul 2008 18:44:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2008 11:44:22 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jul 2008 18:43:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C80DC234C175 for ; Mon, 21 Jul 2008 11:43:31 -0700 (PDT) Message-ID: <918575843.1216665811818.JavaMail.jira@brutus> Date: Mon, 21 Jul 2008 11:43:31 -0700 (PDT) From: "Eks Dev (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index In-Reply-To: <2069170042.1216420771958.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615357#action_12615357 ] Eks Dev commented on LUCENE-1340: --------------------------------- Great, it is already more than I expected, even indexing is going to be somewhat faster. I have tried your patch on smallish index with 8Mio documents and it worked on our regression test without problems. it worked fine with and without omitTf(true), no performance drop or bad surprises when we do not use it. Tomorrow is scheduled real test with production data, around 80Mio very small documents, with some very extensive tests.... I will report back. "The one place I know of that will still waste bytes is the term dict (TermInfo): it stores a long proxPointer on disk (in .tii,.tis) and also in memory because we load *.tii into RAM.... " About this one, it would be nice not to store this as well, but I think the pointers are already reduced to one byte, as they are 0 for these cases (are they,?) So we have this benefit without expecting it :) And yes, more "column stride" is great, if you followed my comments on LUCENE-1278, that would mean we could easily "inline" very short postings into term dict (here I expect huge performance benefit, as skip() on another large file is going to be saved independent from omitTf(true)), without increase in size (or minimal) of tii (no locality penalty) If we follow Zipfian distribution, there is *a lot* of terms with postings shorter than e.g. 16 ... Thanks again for your support, without you this patch would be just another nice idea :) > Make it posible not to include TF information in index > ------------------------------------------------------ > > Key: LUCENE-1340 > URL: https://issues.apache.org/jira/browse/LUCENE-1340 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Eks Dev > Priority: Minor > Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > Term Frequency is typically not needed for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands. > benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone numbers, zip codes, names... > Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests > Complexity: simpler than expected > can be used via omitTf() (who used omitNorms() will know where to find it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org