Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8411 invoked from network); 11 Aug 2006 06:09:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Aug 2006 06:09:02 -0000 Received: (qmail 94199 invoked by uid 500); 11 Aug 2006 06:08:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93775 invoked by uid 500); 11 Aug 2006 06:08:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93764 invoked by uid 99); 11 Aug 2006 06:08:56 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Aug 2006 23:08:56 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of jason.polites@gmail.com designates 64.233.182.188 as permitted sender) Received: from [64.233.182.188] (HELO nf-out-0910.google.com) (64.233.182.188) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Aug 2006 23:08:48 -0700 Received: by nf-out-0910.google.com with SMTP id p48so851791nfa for ; Thu, 10 Aug 2006 23:07:18 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=VD5pFrL5vmYJXjAxciN9ZPmYT3X9awjYKJ7W/ZOUr5VjaeurRryCGWOvm+Vk2i5rHQOxwSNsy4vlYsVr0gC/qtyxB70lyNAYox+fJo3EdM+YFKf5FAQ62/loQiTJbeDM+hQAix/euNCbmcHcdt4vGkX+IviscFh2jw5P3WHbutw= Received: by 10.82.132.4 with SMTP id f4mr349665bud; Thu, 10 Aug 2006 23:07:18 -0700 (PDT) Received: by 10.82.127.16 with HTTP; Thu, 10 Aug 2006 23:07:17 -0700 (PDT) Message-ID: Date: Fri, 11 Aug 2006 16:07:17 +1000 From: "Jason Polites" To: java-user@lucene.apache.org Subject: Re: Field compression too slow In-Reply-To: <44DBEC2E.5070001@mikemccandless.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_13272_20045004.1155276437978" References: <44DBEC2E.5070001@mikemccandless.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_13272_20045004.1155276437978 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I can share the data.. but it would be quicker for you to just pull out some random text from anywhere you like. The issue is that the text was in an email, which was one of about 2,000 and I don't know which one. I got the 4.5MB figure from the number of bytes in the byte array reported in the debugger... and didn't bother to record the email file it was contained in. Anyway.. I think it was text extracted from a PDF extracted from a ZIP... so it would take me a while to locate! It's worth noting that the time I quoted is somewhat misleading. I killed my process after 10 minutes because I realised there was a problem and any further time was irrelevant. But... the length of time is partially due to the load on the process. I am processing multiple files concurrently, and in so doing am performing a bunch of CPU intensive tasks (text extraction, encryption etc). Most of this happens in separate threads, but they are all competing for CPU time. The only way to really benchmark the performance of the compression is to combine both compression levels, with thread numbers to see how it scales. I'm confident that the compression mechanism used in Lucene is fine (had a look at the code... all seems pretty good), so I would guess that Lucene would have performance comparable to "vanilla" compression using the native java libs. I'm betting you get non-linear scalability no matter what the compression level (due to the max throughput of the CPU, bus speed etc); but you may find scalability tends towards a linear curve (oxymoron?) the lower the compression level. This is really what I am looking for. Also.. upon reflection I'm not certain using compression inside the index is really a valuable process without lazy loading anyway. The time-cost of decompression when iterating hits reduces the overall effectiveness of the index. This is obviously solved by lazy loading (for searches) and I am excited about this feature being added. Obviously it depends on the use-case, but in mine I realised that storing large amounts of data in the index is just not the right way to do things. So I changed my architecture so that the larger amounts of data are stored (and compressed) elsewhere, then brought back in when I need to update a document. Of course all my problems would be solved if I had lazy loading AND field updating :) On 8/11/06, Michael McCandless wrote: > > > > I have a sample document which has about 4.5MB of text to be stored as > > compressed data within the field, and the indexing of this document > > seems to > > take an inordinate amount of time (over 10 minutes!). When debugging I > can > > see that it's stuck on the deflate() calls of the Deflater used by > Lucene. > > Would it be possible to get a copy of this document's text (only if > you're able to share it)? I'd like to run some tests to work out the > tradeoff (time taken vs % deflated) of the different levels we can pass > to the zip library. If not that's fine, I'll just run on various random > text sources I can find. > > Thanks. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_13272_20045004.1155276437978--