jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: Oak Scalability: Load Distribution
Date Mon, 03 Mar 2014 10:11:07 GMT
Hi,

The reason why I believe using the hash code is bad for performance is
that it locality of nodes in the backend storage (MongoDB, or any kind of
database) is worse than now. This affects performance when writing, and
when reading siblings. Let's assume we have the nodes
/page_1/paragraph_0..2/child_0..2/ (that is, 3 paragraph nodes, and each
paragraph node has 3 child nodes). With the current approach, the entries
are stored as follows:

1:/page_1
...
2:/page_1/paragraph_0
2:/page_1/paragraph_1

2:/page_1/paragraph_2
...
3:/page_1/paragraph_0/child_0
3:/page_1/paragraph_0/child_1
3:/page_1/paragraph_0/child_2
3:/page_1/paragraph_1/child_0
3:/page_1/paragraph_1/child_1
3:/page_1/paragraph_1/child_2
3:/page_1/paragraph_2/child_0
3:/page_1/paragraph_2/child_1
3:/page_1/paragraph_2/child_2

The child documents (child_0 .. child_2) of all paragraphs are stored next
to each other, likely in the same b-tree page, which means likely in the
same sector on the hard drive. Even if there are many other pages (page_2,
page_3,...). That's good for caching.

With a hash based approach, data is distributed over the whole keyspace.
Even thought child_0 .. child_2 of each paragraph are still next to each
other, locality is much worse. If we assume 3 child nodes per node on
average, we almost have the same situation as with Jackrabbit 2.x, where
node id lookup was by far the biggest performance problem. Because node
ids are randomly distributed in Jackrabbit 2. And using randomly
distributed primary keys is one of the worst performance problems when
using databases. For example, the default object id of MongoDB is
sequential (time based):
http://docs.mongodb.org/manual/reference/object-id/. I'm not worried at
all about the overhead to calculate the hash code of the path. That is
insignificant. It's the random distribution of the data that is
problematic. See also:

http://kccoder.com/mysql/uuid-vs-int-insert-performance/ (specially the
first two charts comparing auto_increment versus uuid)


http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
"What is the biggest problem with UUID ? It is the fact inserts are going
in random locations in index tree which will require a lot of IO in case
of index tree not fitting into memory"


http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At
my job, we use UUID as PKs. What I can tell you from experience is DO NOT
USE THEM as PKs (SQL Server by the way). It's one of those things that
when you have less than 1000 records it;s ok, but when you have millions,
it's the worst thing you can do. Why? Because UUID are not sequential, so
everytime a new record is inserted MSSQL needs to go look at the correct
page to insert the record in, and then insert the record. The really ugly
consequence with this is that the pages end up all in different sizes and
they end up fragmented, so now we have to do de-fragmentation periodic...".


https://issues.apache.org/jira/browse/JCR-2857


Regards,

Thomas










Mime
View raw message