Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of jason.polites@gmail.com
 designates 64.233.182.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=VD5pFrL5vmYJXjAxciN9ZPmYT3X9awjYKJ7W/ZOUr5VjaeurRryCGWOvm+Vk2i5rHQOxwSNsy4vlYsVr0gC/qtyxB70lyNAYox+fJo3EdM+YFKf5FAQ62/loQiTJbeDM+hQAix/euNCbmcHcdt4vGkX+IviscFh2jw5P3WHbutw=
Message-ID: <d8050f240608102307v229c683bn6f9658911809d695@mail.gmail.com>
Date: Fri, 11 Aug 2006 16:07:17 +1000
From: "Jason Polites" <jason.polites@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Field compression too slow
In-Reply-To: <44DBEC2E.5070001@mikemccandless.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_13272_20045004.1155276437978"
References: <d8050f240608100541m188b6116h67b343338c238e3b@mail.gmail.com>
	 <44DBEC2E.5070001@mikemccandless.com>

------=_Part_13272_20045004.1155276437978
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I can share the data.. but it would be quicker for you to just pull out some
random text from anywhere you like.

The issue is that the text was in an email, which was one of about 2,000 and
I don't know which one.  I got the 4.5MB figure from the number of bytes in
the byte array reported in the debugger... and didn't bother to record the
email file it was contained in.  Anyway.. I think it was text extracted from
a PDF extracted from a ZIP... so it would take me a while to locate!

It's worth noting that the time I quoted is somewhat misleading.  I killed
my process after 10 minutes because I realised there was a problem and any
further time was irrelevant.  But... the length of time is partially due to
the load on the process.

I am processing multiple files concurrently, and in so doing am performing a
bunch of CPU intensive tasks (text extraction, encryption etc).  Most of
this happens in separate threads, but they are all competing for CPU time.

The only way to really benchmark the performance of the compression is to
combine both compression levels, with thread numbers to see how it scales.

I'm confident that the compression mechanism used in Lucene is fine (had a
look at the code... all seems pretty good), so I would guess that Lucene
would have performance comparable  to "vanilla" compression using the native
java libs.

I'm betting you get non-linear scalability no matter what the compression
level (due to the max throughput of the CPU, bus speed etc); but you may
find scalability tends towards a linear curve (oxymoron?) the lower the
compression level.

This is really what I am looking for.

Also.. upon reflection I'm not certain using compression inside the index is
really a valuable process without lazy loading anyway.  The time-cost of
decompression when iterating hits reduces the overall effectiveness of the
index.  This is obviously solved by lazy loading (for searches) and I am
excited about this feature being added.  Obviously it depends on the
use-case, but in mine I realised that storing large amounts of data in the
index is just not the right way to do things.  So I changed my architecture
so that the larger amounts of data are stored (and compressed) elsewhere,
then brought back in when I need to update a document.

Of course all my problems would be solved if I had lazy loading AND field
updating :)

On 8/11/06, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>
> > I have a sample document which has about 4.5MB of text to be stored as
> > compressed data within the field, and the indexing of this document
> > seems to
> > take an inordinate amount of time (over 10 minutes!).  When debugging I
> can
> > see that it's stuck on the deflate() calls of the Deflater used by
> Lucene.
>
> Would it be possible to get a copy of this document's text (only if
> you're able to share it)?  I'd like to run some tests to work out the
> tradeoff (time taken vs % deflated) of the different levels we can pass
> to the zip library.  If not that's fine, I'll just run on various random
> text sources I can find.
>
> Thanks.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_13272_20045004.1155276437978--