Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of a.schrijvers@1hippo.com
 designates 64.18.2.18 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <AANLkTimJZW4yhdo+_yNk9u6WWHq5rgnVK-vcg7-s9bEe@mail.gmail.com>
References: <AANLkTinfc_kU+phRHqUmATCrxC-ETyt33Ept3yb5cApc@mail.gmail.com>
	<AANLkTi=khBe91CM0FabJm5yfwt-=0W3vEDEy4LSaexs7@mail.gmail.com>
	<AANLkTimJZW4yhdo+_yNk9u6WWHq5rgnVK-vcg7-s9bEe@mail.gmail.com>
Date: Mon, 9 Aug 2010 17:36:15 +0200
Message-ID: <AANLkTin+mOAEhPaVfrc9EU36QqHi17xAkO7UNSbquQrX@mail.gmail.com>
Subject: Re: Jackrabbit performance data
From: Ard Schrijvers <a.schrijvers@onehippo.com>
To: dev@jackrabbit.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hello,

On Mon, Aug 9, 2010 at 4:54 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
>
> On Mon, Aug 9, 2010 at 3:53 PM, Ard Schrijvers
> <a.schrijvers@onehippo.com> wrote:
>> First of all, thanks a lot for this Jukka. I really like it. Would you
>> have an idea how we could measure performance for larger repositories.
>> For example, I would be glad to add some query performance tests, but,
>> obviously, querying can be very sensitive to the number of nodes. I
>> would be interested in the performance of some queries (of xpath, sql
>> and qom) against different repository version, but then specifically
>> queries against large repositories. I understand if it is not feasible
>> because the tests would take to long. WDYT?
>
> The size of the test repository shouldn't be too much of a problem, as
> long as the setup/teardown code doesn't take hours to complete. A few
> minutes per test is still quite OK; you can create quite a bit of test
> content in that time.

The thing is that I am particularly interested in doing searches
against, say 100K+ nodes. I have downloaded 6 Gb of wiki xml pages. I
would like to see search improvements/degradations between versions
when the amount of data is large. It is important that we see when we
implement some search feature that doesn't scale to well. Obviously,
the unit tests search index is an in memory one, which might also
influence the real numbers.

> The test suite currently doesn't allow multiple
> different tests to share test content, but that should be easy to
> solve by introducing a concept of test groups with their own
> setup/teardown phases.

Yes, true. Only obviously hard when some tests modify the data. This
might again influence other tests, where we get some hard to see
inter-dependence between tests. Certainly, for searching for example,
it is interesting to see *how* long the first search in a warmed up
environment *after* some persisted modification takes.

>
> A more essential consideration is the time it takes to execute a
> single test query. Currently the test suite is configured to spend 50
> seconds iterating over a single performance tests, so to get good
> statistics an individual test shouldn't take much longer than a few
> seconds. We can increase the execution time, but I think a few seconds
> should in any case be the upper limit for most interesting search use
> cases.

Well, I think when for example a search takes more than a second we
should get some alarm bells anyway :) , so, 50 seconds seems more then
fine to me.

>
> See the simple search test case I added in revision 983662. It would

Thank you so much, you make it very easy for me :)

> be great if you'd be interested in adding more complex search
> benchmarks.

Yes, I am. I'll try to find some time (think it will be my spare time
so hope this or next weekend to be able to do so) on short notice to
play around with the tests, and add a bunch of search tests. I am
interested to see the evolution between versions of some searches, and
also the scalability (within a single repo version) of some searches.
For example path constrained searches and range queries.

Particularly I am interested in the search performance numbers as I
think we need to invest time in some search refactoring: I think, the
Jackrabbit Search implementation was really state-of-the-art against
the original Lucene version it was built against. But, now, it suffers
imo from some of this historical grown things, like the 'multiple
indexes' (I tested IndexReader.reopen() against couple of millions of
lucene docs available since Lucene 2.3.0 : I think this reopen pretty
much does on Lucene segment level what Jackrabbit does on indexes
level. It keeps all valid segments open. We can make so much code
easier).  I had some interesting talk with a Hibernate Search
developer, facing similar requirements like real time search. Also,
recent Lucene improvements like TrieRange queries, and upcoming
features like NestedDocumentQueries and incremental field updates
might get really interesting for us as well.  But I do not want to go
to much into detail now, as it should be another thread. Hope to get
back on this in not to long.

To get back to this thread, the performance plots would make hopefully
my findings visible, or not if I am just wrong :)

Thanks a lot again Jukka,

Regards Ard

>
> BR,
>
> Jukka Zitting
>