jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject [jr3] Index on randomly distributed data
Date Tue, 06 Mar 2012 09:34:20 GMT
Hi,

In Jackrabbit 2, we currently use a randomly generated UUID as the node
id. For Jackrabbit 3 this is an option question. I was looking for ways to
index randomly distributed data, but so far didn't find a solution. A
Google query for "uuid primary key performance" gave me:

http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
my job, we use UUID as PKs. What I can tell you from experience is DO NOT
USE THEM as PKs ... It's one of those things that when you have less than
1000 records it;s ok, but when you have millions, it's the worst thing you
can do. Why? Because UUID are not sequential..."

http://kccoder.com/mysql/uuid-vs-int-insert-performance/ "it takes 25
hours to insert 15 million records into an empty UUID table"

http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
"For auto_increment key load process took 1 hour 50 minutes ... For UUID
process took over 12 hours and is still going...  So in this little case
we have about 200 times performance difference"

I believe if we rely on an index on randomly distributed data, performance
will degrade (factor 10 or more, depending on the repository size, the
memory, and potentially on the number of changes). For Jackrabbit 2, to
solve this performance problem, we can actually switch to sequential node
ids - see JCR-2857. For Jackrabbit 3, if we use the content hash as the
node id, then it wouldn't be possible to switch (it is not possible to
generate sequential content hashes). With content hashes, one option is to
make sure the index is always in memory. However, I believe we should not
build a system that has such constraints, unless the alternative
(sequential node ids) has problems we can not solve otherwise.

Regards,
Thomas


Mime
View raw message