jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raffaele Sena <raff...@gmail.com>
Subject Re: [jr3] Index on randomly distributed data
Date Tue, 06 Mar 2012 13:25:49 GMT
Can you use time based UUID.  They should be guaranteed to be sequential
and unique in the cluster.
On Mar 6, 2012 4:47 AM, "Bart van der Schans" <b.vanderschans@onehippo.com>
wrote:

> Hi Thomas,
>
> On Tue, Mar 6, 2012 at 10:34 AM, Thomas Mueller <mueller@adobe.com> wrote:
> > Hi,
> >
> > In Jackrabbit 2, we currently use a randomly generated UUID as the node
> > id. For Jackrabbit 3 this is an option question. I was looking for ways
> to
> > index randomly distributed data, but so far didn't find a solution. A
> > Google query for "uuid primary key performance" gave me:
> >
> > http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> > my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> > USE THEM as PKs ... It's one of those things that when you have less than
> > 1000 records it;s ok, but when you have millions, it's the worst thing
> you
> > can do. Why? Because UUID are not sequential..."
> >
> > http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> > hours to insert 15 million records into an empty UUID table"
> >
> > http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> > "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> > process took over 12 hours and is still going...  So in this little case
> > we have about 200 times performance difference"
> >
> > I believe if we rely on an index on randomly distributed data,
> performance
> > will degrade (factor 10 or more, depending on the repository size, the
> > memory, and potentially on the number of changes). For Jackrabbit 2, to
> > solve this performance problem, we can actually switch to sequential node
> > ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> > node id, then it wouldn't be possible to switch (it is not possible to
> > generate sequential content hashes). With content hashes, one option is
> to
> > make sure the index is always in memory. However, I believe we should not
> > build a system that has such constraints, unless the alternative
> > (sequential node ids) has problems we can not solve otherwise.
>
> I think this is an important subject. Databases are generally
> optimized for sequential indexes and will perform better. What I'm
> wondering about is how we would like to handle the id generation (or
> the uuid generation for that matter) in a cluster. Do we want to make
> it the responsibility of the storage layer (aka auto increment,
> sequences, etc) or the responsibility of jackrabbit? How can we
> guarantee ids are unique *and* sequential in a cluster without (too
> much) overhead?
>
> Regards,
> Bart
>

Mime
View raw message