jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bart van der Schans <b.vandersch...@onehippo.com>
Subject Re: [jr3] Index on randomly distributed data
Date Tue, 06 Mar 2012 12:46:51 GMT
Hi Thomas,

On Tue, Mar 6, 2012 at 10:34 AM, Thomas Mueller <mueller@adobe.com> wrote:
> Hi,
> In Jackrabbit 2, we currently use a randomly generated UUID as the node
> id. For Jackrabbit 3 this is an option question. I was looking for ways to
> index randomly distributed data, but so far didn't find a solution. A
> Google query for "uuid primary key performance" gave me:
> http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
> my job, we use UUID as PKs. What I can tell you from experience is DO NOT
> USE THEM as PKs ... It's one of those things that when you have less than
> 1000 records it;s ok, but when you have millions, it's the worst thing you
> can do. Why? Because UUID are not sequential..."
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
> hours to insert 15 million records into an empty UUID table"
> http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
> "For auto_increment key load process took 1 hour 50 minutes ... For UUID
> process took over 12 hours and is still going...  So in this little case
> we have about 200 times performance difference"
> I believe if we rely on an index on randomly distributed data, performance
> will degrade (factor 10 or more, depending on the repository size, the
> memory, and potentially on the number of changes). For Jackrabbit 2, to
> solve this performance problem, we can actually switch to sequential node
> ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
> node id, then it wouldn't be possible to switch (it is not possible to
> generate sequential content hashes). With content hashes, one option is to
> make sure the index is always in memory. However, I believe we should not
> build a system that has such constraints, unless the alternative
> (sequential node ids) has problems we can not solve otherwise.

I think this is an important subject. Databases are generally
optimized for sequential indexes and will perform better. What I'm
wondering about is how we would like to handle the id generation (or
the uuid generation for that matter) in a cluster. Do we want to make
it the responsibility of the storage layer (aka auto increment,
sequences, etc) or the responsibility of jackrabbit? How can we
guarantee ids are unique *and* sequential in a cluster without (too
much) overhead?


View raw message