jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject [jr3] Sequential Node Ids
Date Wed, 22 Dec 2010 08:51:48 GMT
Currently, Jackrabbit uses random node ids (randomly generated UUIDs). For Jackrabbit 3, and
possibly Jackrabbit 2.x, I propose to use sequential node ids.

The main advantages for random node ids (the default) are: it's easy to develop, and merging
data from multiple repositories is simple.

The disadvantages for random node ids are performance and space. With performance, I don't
mean that randomly generated UUIDs are slower to generate (they are). I mean performance to
read and write data to disk. The main performance problem is reading and writing (indexing)
random keys, which is slow because adjacent (nearby) nodes data aren't adjacent in the index.
This slows down index access (read and write) a lot, specially if the data doesn't fit in
the buffer cache (file system cache).

I wrote a simple test case for Jackrabbit trunk. I also patched Jackrabbit to allow sequential
UUIDs. Creating 1 million nodes is about 80% faster when using sequential UUIDs instead of
random UUIDs. I don't have the data yet, but I'm sure the difference is much, much bigger
when reading from larger repositories once the data doesn't fit in RAM (10 million nodes or
so). I will expand the test case to support reading.

The following diagram is the operations per second (bigger is better) after inserting x nodes
(0 to 1 million). With sequential UUIDs, performance gets better and then stays the same (which
is expected). For random UUIDs which are now the default, performance first goes up and then
down. The resulting database (I used H2 for performance) was about 145 MB for sequential UUIDs
and 169 MB for random UUIDs (the reason for this difference is that the b-tree fill factor
is better for sequential UUIDs). All the data still fits in the buffer cache of the file system
without problem.

[cid:82D27269-8424-4062-B68F-568E482436D3]


Something else I noticed when running the test case was that the Jackrabbit internals are
relatively slow.  When profiling the test case, I saw that database access is not an important
factor. Creating the nodes with the JCR API took 285 / 526 seconds, and re-creating the database
from a SQL script took 55 seconds (this includes reading the 200 MB SQL script file and parsing
the SQL statements, one insert statement per row; this is for the random UUIDs). I would expect
Jackrabbit is faster than 55 seconds, because it doesn't have to read the SQL script from
disk and parse statements.

The test case creates 1 million nodes, saves after each 10'000 nodes, and uses a fan-out of
20 child nodes per node. I disabled the search index in the repository configuration. Source
code:



import java.io.File;

import javax.jcr.Node;

import javax.jcr.Session;

import javax.jcr.SimpleCredentials;

import org.apache.commons.io.FileUtils;

import org.apache.jackrabbit.core.TransientRepository;

import org.apache.jackrabbit.core.id.NodeId;


public class TestLoop {


    long start;

    Profiler prof;


    public static void main(String... args) throws Exception {

        new TestLoop().run();

    }


    void startProf() {

//        if(prof != null) {

//            prof.stopCollecting();

//            System.out.println(prof.getTop(3));

//        }

//        prof = new Profiler();

//        prof.interval = 1;

//        prof.depth = 32;

//        prof.startCollecting();

    }


    void run() throws Exception {

        System.setProperty("jackrabbit.sequentialUUIDs", "false");

        startProf();

        FileUtils.deleteDirectory(new File("repository"));

        FileUtils.copyFile(new File("repository copy.xml"),

                new File("repository.xml"));

        TransientRepository rep = new TransientRepository();

        Session s = rep.login(

                new SimpleCredentials("admin", "admin".toCharArray()));

        int nodes = 1000000;

        int nodesPerFolder = 20;

        int saveEach = 10000;

        System.out.println(new NodeId());

        int levels = (int) Math.ceil(Math.log(nodes)

                / Math.log(nodesPerFolder));

        start = System.currentTimeMillis();

        startProf();

        createNodes(s.getRootNode(), 0, nodes,

                saveEach, levels, nodesPerFolder);

        System.out.println("took: " + (System.currentTimeMillis() - start));

        s.logout();

    }


    private int createNodes(Node parent, int x, int nodes,

            int saveEach, int depth, int nodesPerFolder)

            throws Exception {

        for (int i = 0; x < nodes && i < nodesPerFolder; i++) {

            if (depth > 1) {

                Node n = parent.addNode("f" + x, "nt:unstructured");

                x = createNodes(n, x, nodes,

                        saveEach, depth - 1, nodesPerFolder);

            } else {

                parent.addNode("n" + i, "nt:unstructured");

                x++;

                if ((x % saveEach) == 0) {

                    parent.getSession().save();

                    long t = System.currentTimeMillis() - start;

                    startProf();

                    System.out.println((x * 1000L / t) + " op/s at " + x);

                }

            }

        }

        return x;

    }

}





Mime
View raw message