jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Guggisberg <stefan.guggisb...@gmail.com>
Subject Re: Using Jackrabbit/JCR as IDE workspace data backend
Date Mon, 26 Sep 2011 16:13:28 GMT
On Mon, Sep 26, 2011 at 3:51 PM, Marcel Bruch <marcel.bruch@gmail.com> wrote:
> Thanks Stefan. I gave it a try. Could you or someone else comment on
> the code and its performance?
>
> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
> an operation takes? Is it possible to improve the performance somehow?

the performance seems rather poor. it's hard to tell what's wrong
without having the test data. i noticed that you're storing the
content of the .json files as string properties. why aren't you
storing the json data as nodes & properties?

anyway, i quickly ran an adapted ad hoc test on my machine
(macbook pro 2.66 ghz, standard harddisk). the test imports
an 'svn export' of jackrabbit/trunk.

importing ~6500 files takes ~30s which is IMO decent.

cheers
stefan


/////////////////////////////////////////////////////////////////////////////////////////////////////////
import org.apache.commons.io.FileUtils;
import org.apache.jackrabbit.core.TransientRepository;

import javax.jcr.Node;
import javax.jcr.Session;
import javax.jcr.SimpleCredentials;
import java.io.File;
import java.io.FileInputStream;
import java.util.Calendar;

public class JcrArtifactStoreTest {

    static final String FILE_ROOT = "/Users/stefan/tmp/jackrabbit-src/";

    static final boolean STORE_BINARY = false;

    static int count = 0;
    static long size = 0;
    static long ts = 0;

    public static void main(String[] args) throws Exception {

        TransientRepository repository = new TransientRepository();
        Session session = repository.login(new
SimpleCredentials("admin", "admin".toCharArray()));

        ts = System.currentTimeMillis();
        long ts0 = ts;

        importNode(new File(FILE_ROOT), session.getRootNode());

        session.save();

        long ts1 = System.currentTimeMillis();
        System.out.printf("%d ms: %d units persisted. data %s\n", ts1
- ts, count,
                FileUtils.byteCountToDisplaySize(size));
        ts = ts1;

        System.out.println("Total time: " + (ts1 - ts0) + " ms");
    }

    static void importNode(File file, Node parent) throws Exception {
        if (file.isDirectory()) {
            Node newNode = parent.addNode(file.getName(), "nt:folder");
            File[] children = file.listFiles();
            if (children != null) {
                for (int i = 0; i < children.length; i++) {
                    importNode(children[i], newNode);
                }
            }
        } else {
            Node newNode = parent.addNode(file.getName(), "nt:file");
            String nt = STORE_BINARY ? "nt:resource" : "nt:unstructured";
            Node content = newNode.addNode("jcr:content", nt);
            if (STORE_BINARY) {
                content.setProperty("jcr:data", new FileInputStream(file));
            } else {
                content.setProperty("jcr:data",
FileUtils.readFileToString(file));
            }
            content.setProperty("jcr:lastModified", Calendar.getInstance());
            content.setProperty("jcr:mimeType", "application/octet-stream");

            size += file.length();
            count++;
            if (++count % 500 == 0) {
                parent.getSession().save();

                long ts1 = System.currentTimeMillis();

                System.out.printf("%d ms: %d units persisted. data
%s\n", ts1 - ts, count,
                        FileUtils.byteCountToDisplaySize(size));
                ts = ts1;
            }
        }
    }
}



>
> The code I used to persist data is given below. The pure IO time w/o
> jackrabbit is ~1second w/ solid state disk.
>
> Thanks for your comments,
> Marcel
>
> Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
> Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
> Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
> Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
> Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
> Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
> Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
> Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
> Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
> Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
> Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
> Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
> Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
> Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
> Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
> Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
> Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
> Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
> Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
> Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
> Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
> Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
> Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
> Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
> Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
> Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
> Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
> Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
> Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
> Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted
>
>
> public class JcrArtifactStoreTest {
>
>    private TransientRepository repository;
>    private Session session;
>
>    @Before
>    public void setup() throws RepositoryException {
>
>        final File basedir = new File("recommenders/").getAbsoluteFile();
>        basedir.mkdir();
>        repository = new TransientRepository(basedir);
>        session = repository.login(new SimpleCredentials("username",
> "password".toCharArray()));
>    }
>
>    @Test
>    public void test2() throws ConfigurationException,
> RepositoryException, IOException {
>
>        int i = 0;
>        int size = 0;
>        final Iterator<File> it = findDataFiles();
>        final Node rootNode = session.getRootNode();
>
>        while (it.hasNext()) {
>            final File file = it.next();
>            Node activeNode = rootNode;
>            for (final String segment : new
> Path(file.getAbsolutePath()).segments()) {
>                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
>            }
>            // System.out.println(activeNode.getPath());
>            final String content = Files.toString(file, Charsets.UTF_8);
>            size += content.getBytes().length;
>            activeNode.setProperty("cu", content);
>            if (++i % 200 == 0) {
>                session.save();
>                System.out.printf("%s: %d units persisted.  data %s
> \n", new Date(), i,
>                        FileUtils.byteCountToDisplaySize(size));
>            }
>        }
>        session.save();
>        System.out.printf("%s: %d units persisted\n", new Date(), i);
>    }
>
>    @SuppressWarnings("unchecked")
>    private Iterator<File> findDataFiles() {
>        return FileUtils.iterateFiles(new
> File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
>                FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
>    }
>
>
>
>
> 2011/9/26 Stefan Guggisberg <stefan.guggisberg@gmail.com>:
>> hi marcel,
>>
>> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <marcel.bruch@gmail.com> wrote:
>>> Hi,
>>>
>>> I'm looking for some advice whether Jackrabbit might be a good choice for my
problem. Any comments on this are greatly appreciated.
>>>
>>>
>>> = Short description of the challenge =
>>>
>>> We've built a Eclipse based tool that analyzes java source files and stores its
analysis results in additional files. The workspace  potentially has hundreds of projects
and each project may have up to a few thousands of files. Say, there will be 200 projects
and 1000 java source files per project in a single workspace. Then, there will be 200*1000
= 200.000 files.
>>>
>>> On a full workspace build, all these 200k files have to be compiled (by the IDE)
and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather
fast.
>>> But the most common use case is that a single file is changed several times per
minute and thus gets frequently analyzed.
>>>
>>> At the moment, the analysis results are dumped on disk as plain json files; one
json file for each java class. Each json file is around 5 to 100kb in size; some files grow
up to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which
might perfectly map to nodes in JCR).
>>>
>>> = Question =
>>>
>>> We would like to change the simple file system approach by a more sophisticated
approach and I wonder whether Jackrabbit may be a suitable backend for this use case. Since
we map all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this
but I can't say for sure.
>>>
>>> What's your suggestion? Is Jackrabbit capable to quickly load and store json-like
data - even if 200k files (nodes + their sub-nodes) have to be updated very in very short
time?
>>
>> absolutely. if the data is reasonably structured/organized jackrabbit
>> should be a perfect fit.
>> i suggest to leverage the java package space hierarchy for organizing the data
>> (i.e. org.apache.jackrabbit.core.TransientRepository ->
>> /org/apache/jackrabbit/core/TransientRepository).
>> for further data modeling recommondations see [0].
>>
>> cheers
>> stefan
>>
>> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>>
>>>
>>>
>>> Thanks for your suggestions. I've you need more details on what operations are
performed or how data looks like, I would be glad to take your questions.
>>>
>>> Marcel
>>>
>

Mime
View raw message