Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (nike.apache.org: domain of b.vanderschans@1hippo.com
 designates 64.18.2.169 as permitted sender)
Received-SPF: pass (google.com: domain of b.vanderschans@1hippo.com designates
 10.182.86.201 as permitted sender) client-ip=10.182.86.201;
MIME-Version: 1.0
In-Reply-To: <CB7B98AB.26EC2%mueller@adobe.com>
References: <CB7B98AB.26EC2%mueller@adobe.com>
From: Bart van der Schans <b.vanderschans@onehippo.com>
Date: Tue, 6 Mar 2012 13:46:51 +0100
Message-ID: 
 <CAAOnkMsF4juw75uxt4YwdUB9A3JhVbXk5VzjCNZbxpgmSxrFuQ@mail.gmail.com>
Subject: Re: [jr3] Index on randomly distributed data
To: dev@jackrabbit.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Thomas,

On Tue, Mar 6, 2012 at 10:34 AM, Thomas Mueller <mueller@adobe.com> wrote:
> Hi,
>
> In Jackrabbit 2, we currently use a randomly generated UUID as the node
> id. For Jackrabbit 3 this is an option question. I was looking for ways t=
o
> index randomly distributed data, but so far didn't find a solution. A
> Google query for "uuid primary key performance" gave me:
>
> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> USE THEM as PKs ... It's one of those things that when you have less than
> 1000 records it;s ok, but when you have millions, it's the worst thing yo=
u
> can do. Why? Because UUID are not sequential..."
>
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> hours to insert 15 million records into an empty UUID table"
>
> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> process took over 12 hours and is still going... =C2=A0So in this little =
case
> we have about 200 times performance difference"
>
> I believe if we rely on an index on randomly distributed data, performanc=
e
> will degrade (factor 10 or more, depending on the repository size, the
> memory, and potentially on the number of changes). For Jackrabbit 2, to
> solve this performance problem, we can actually switch to sequential node
> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> node id, then it wouldn't be possible to switch (it is not possible to
> generate sequential content hashes). With content hashes, one option is t=
o
> make sure the index is always in memory. However, I believe we should not
> build a system that has such constraints, unless the alternative
> (sequential node ids) has problems we can not solve otherwise.

I think this is an important subject. Databases are generally
optimized for sequential indexes and will perform better. What I'm
wondering about is how we would like to handle the id generation (or
the uuid generation for that matter) in a cluster. Do we want to make
it the responsibility of the storage layer (aka auto increment,
sequences, etc) or the responsibility of jackrabbit? How can we
guarantee ids are unique *and* sequential in a cluster without (too
much) overhead?

Regards,
Bart