Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 32851 invoked from network); 3 Oct 2008 11:27:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Oct 2008 11:27:11 -0000 Received: (qmail 13342 invoked by uid 500); 3 Oct 2008 11:27:04 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 13300 invoked by uid 500); 3 Oct 2008 11:27:03 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 13291 invoked by uid 99); 3 Oct 2008 11:27:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2008 04:27:03 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2008 11:26:09 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 42011234C212 for ; Fri, 3 Oct 2008 04:26:44 -0700 (PDT) Message-ID: <608340566.1223033204269.JavaMail.jira@brutus> Date: Fri, 3 Oct 2008 04:26:44 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1410) PFOR implementation In-Reply-To: <521761508.1222889624375.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1410: --------------------------------------- Attachment: TestPFor2.java Paul, I'm eager to test pfor on real Lucene vInts. So I created a simple test program (attached TestPFor2.java). Run it like this: {code} Usage: java org.apache.lucene.util.pfor.TestPFor2 Eg: java org.apache.lucene.util.pfor.TestPFor2 /lucene/index _l.prx _l.prx.pfor {code} where indexDirName is the directory of a Lucene index, vIntFileNameIn should be a file that just has a bunch of vInts (Lucene's *.frq and *.prx fit this) and pForFileNameOut is a file it writes with blocks encoded in PFor. It first encodes the vInts from vIntFileNameIn into pfor blocks written to pForFileNameOut. Then it measures decode time of reading all vInts from vIntFileNameIn vs decode time of reading all pfor blocks. It runs each round 5 times. The test has certain serious flaws: * Can you add a method that figures out the right frame size to use for a given block of ints (so that ~90% of the ints are < N bits)? * I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? With these fixes the test would be more fair to pfor. In the PFor file that I write, I simply write an int (# bytes) followed by the bytes, for each block. Then when reading these blocks I read the #bytes, then read into the byte array backing the intArray passed to the PFor for decompress. In a real integration I think writing the int #bytes should be unecessary (is pfor self puncuating?). This is inefficient because in doing this for real we should avoid the double-copy of the byte[]. In fact, we might push it even lower (under IndexInput, eg, create a IntBlockIndexInput) to possibly avoid the copy into byte[] in BufferedIndexInput by maybe using direct ByteBuffers from the OS. So even once we fix the top two issues above, the results of a "real" integration should be still faster. I ran this on a 622 MB prx file from a Wikipedia index, and even with the above 2 limitations it's still a good amount faster: {code} java org.apache.lucene.util.pfor.TestPFor2 /lucene/big _l.prx _l.prx.pfor encode _l.prx to _l.prx.pfor... 442979072 vints; 888027375 bytes compressed vs orig size 651670377 decompress using pfor: 4198 msec to decode 442979072 vInts (105521 vInts/msec) 4332 msec to decode 442979072 vInts (102257 vInts/msec) 4165 msec to decode 442979072 vInts (106357 vInts/msec) 4122 msec to decode 442979072 vInts (107467 vInts/msec) 4061 msec to decode 442979072 vInts (109081 vInts/msec) decompress using readVInt: 7315 msec to read 442979104 vInts (60557 vInts/msec) 7390 msec to read 442979104 vInts (59943 vInts/msec) 5816 msec to read 442979104 vInts (76165 vInts/msec) 5937 msec to read 442979104 vInts (74613 vInts/msec) 5970 msec to read 442979104 vInts (74200 vInts/msec) {code} It's really weird how the time gets suddenly faster during readVInt. It's very repeatable. on another machine I see it get suddenly slower starting at the same (3rd) round. Adding -server and -Xbatch doesn't change this behavior. This is with (build 1.6.0_10-rc-b28) on Linux and (build 1.6.0_05-b13-120) on Mac OS X 10.5.5. > PFOR implementation > ------------------- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other > Reporter: Paul Elschot > Priority: Minor > Attachments: LUCENE-1410b.patch, TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org