jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürig <mdue...@apache.org>
Subject Re: [jr3] Sequential Node Ids
Date Thu, 23 Dec 2010 14:00:51 GMT

Hi Thomas,

I think these are interesting graphs and they show that we do not 
exploit locality sufficiently when persisting nodes. However, your 
results base on the underlying assumption that sequential node ids lead 
to better locality than random node ids do. In general this will depends 
on the access pattern and the underlying persistence store: While for 
the append only case sequential node ids might exhibit good locality, 
for large repositories with many random CRUD operations this approach 
might eventually degenerate to behavior similar to random node ids.

I suggest to abstract node ids from storage ids: the node id is an 
identifier for retrieving nodes from JCR. A storage id is an identifier 
which is used by the persistence store to retrieve a certain piece of 
data (which could contain a node, part of a node or multiple nodes). A 
mapping from nodes ids to storage ids can then be used to tune locality: 
i.e. sequential vs. random, optimal block size etc.

Michael

On 22.12.10 9:51, Thomas Mueller wrote:
> Currently, Jackrabbit uses random node ids (randomly generated UUIDs).
> For Jackrabbit 3, and possibly Jackrabbit 2.x, I propose to use
> sequential node ids.
>
> The main advantages for random node ids (the default) are: it's easy to
> develop, and merging data from multiple repositories is simple.
>
> The disadvantages for random node ids are performance and space. With
> performance, I don't mean that randomly generated UUIDs are slower to
> generate (they are). I mean performance to read and write data to disk.
> The main performance problem is reading and writing (indexing) random
> keys, which is slow because adjacent (nearby) nodes data aren't adjacent
> in the index. This slows down index access (read and write) a lot,
> specially if the data doesn't fit in the buffer cache (file system cache).
>
> I wrote a simple test case for Jackrabbit trunk. I also patched
> Jackrabbit to allow sequential UUIDs. Creating 1 million nodes is about
> 80% faster when using sequential UUIDs instead of random UUIDs. I don't
> have the data yet, but I'm sure the difference is much, much bigger when
> reading from larger repositories once the data doesn't fit in RAM (10
> million nodes or so). I will expand the test case to support reading.
>
> The following diagram is the operations per second (bigger is better)
> after inserting x nodes (0 to 1 million). With sequential UUIDs,
> performance gets better and then stays the same (which is expected). For
> random UUIDs which are now the default, performance first goes up and
> then down. The resulting database (I used H2 for performance) was about
> 145 MB for sequential UUIDs and 169 MB for random UUIDs (the reason for
> this difference is that the b-tree fill factor is better for sequential
> UUIDs). All the data still fits in the buffer cache of the file system
> without problem.
>
>
>
> Something else I noticed when running the test case was that the
> Jackrabbit internals are relatively slow. When profiling the test case,
> I saw that database access is not an important factor. Creating the
> nodes with the JCR API took 285 / 526 seconds, and re-creating the
> database from a SQL script took 55 seconds (this includes reading the
> 200 MB SQL script file and parsing the SQL statements, one insert
> statement per row; this is for the random UUIDs). I would expect
> Jackrabbit is faster than 55 seconds, because it doesn't have to read
> the SQL script from disk and parse statements.
>
> The test case creates 1 million nodes, saves after each 10'000 nodes,
> and uses a fan-out of 20 child nodes per node. I disabled the search
> index in the repository configuration. Source code:
>
>
> *import* java.io.File;
>
> *import* javax.jcr.Node;
>
> *import* javax.jcr.Session;
>
> *import* javax.jcr.SimpleCredentials;
>
> *import* org.apache.commons.io.FileUtils;
>
> *import* org.apache.jackrabbit.core.TransientRepository;
>
> *import* org.apache.jackrabbit.core.id.NodeId;
>
>
> *public* *class* TestLoop {
>
>
> *long* start;
>
> Profiler prof;
>
>
> *public* *static* *void* main(String... args) *throws* Exception {
>
> *new* TestLoop().run();
>
> }
>
>
> *void* startProf() {
>
> // if(prof != null) {
>
> // prof.stopCollecting();
>
> // System.out.println(prof.getTop(3));
>
> // }
>
> // prof = new Profiler();
>
> // prof.interval = 1;
>
> // prof.depth = 32;
>
> // prof.startCollecting();
>
> }
>
>
> *void* run() *throws* Exception {
>
> System.setProperty("jackrabbit.sequentialUUIDs", "false");
>
> startProf();
>
> FileUtils.deleteDirectory(*new* File("repository"));
>
> FileUtils.copyFile(*new* File("repository copy.xml"),
>
> *new* File("repository.xml"));
>
> TransientRepository rep = *new* TransientRepository();
>
> Session s = rep.login(
>
> *new* SimpleCredentials("admin", "admin".toCharArray()));
>
> *int* nodes = 1000000;
>
> *int* nodesPerFolder = 20;
>
> *int* saveEach = 10000;
>
> System.out.println(*new* NodeId());
>
> *int* levels = (*int*) Math.ceil(Math.log(nodes)
>
> / Math.log(nodesPerFolder));
>
> start = System.currentTimeMillis();
>
> startProf();
>
> createNodes(s.getRootNode(), 0, nodes,
>
> saveEach, levels, nodesPerFolder);
>
> System.out.println("took: " + (System.currentTimeMillis() - start));
>
> s.logout();
>
> }
>
>
> *private* *int* createNodes(Node parent, *int* x, *int* nodes,
>
> *int* saveEach, *int* depth, *int* nodesPerFolder)
>
> *throws* Exception {
>
> *for* (*int* i = 0; x < nodes && i < nodesPerFolder; i++) {
>
> *if* (depth > 1) {
>
> Node n = parent.addNode("f" + x, "nt:unstructured");
>
> x = createNodes(n, x, nodes,
>
> saveEach, depth - 1, nodesPerFolder);
>
> } *else* {
>
> parent.addNode("n" + i, "nt:unstructured");
>
> x++;
>
> *if* ((x % saveEach) == 0) {
>
> parent.getSession().save();
>
> *long* t = System.currentTimeMillis() - start;
>
> startProf();
>
> System.out.println((x * 1000L / t) + " op/s at " + x);
>
> }
>
> }
>
> }
>
> *return* x;
>
> }
>
> }
>
>
>
>
>


Mime
View raw message