jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Bruch <marcel.br...@gmail.com>
Subject Re: Using Jackrabbit/JCR as IDE workspace data backend
Date Mon, 26 Sep 2011 13:51:18 GMT
Thanks Stefan. I gave it a try. Could you or someone else comment on
the code and its performance?

I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
Storing ~240 MB took roughly 3 minutes. Is this the expected time such
an operation takes? Is it possible to improve the performance somehow?

The code I used to persist data is given below. The pure IO time w/o
jackrabbit is ~1second w/ solid state disk.

Thanks for your comments,
Marcel

Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted


public class JcrArtifactStoreTest {

    private TransientRepository repository;
    private Session session;

    @Before
    public void setup() throws RepositoryException {

        final File basedir = new File("recommenders/").getAbsoluteFile();
        basedir.mkdir();
        repository = new TransientRepository(basedir);
        session = repository.login(new SimpleCredentials("username",
"password".toCharArray()));
    }

    @Test
    public void test2() throws ConfigurationException,
RepositoryException, IOException {

        int i = 0;
        int size = 0;
        final Iterator<File> it = findDataFiles();
        final Node rootNode = session.getRootNode();

        while (it.hasNext()) {
            final File file = it.next();
            Node activeNode = rootNode;
            for (final String segment : new
Path(file.getAbsolutePath()).segments()) {
                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
            }
            // System.out.println(activeNode.getPath());
            final String content = Files.toString(file, Charsets.UTF_8);
            size += content.getBytes().length;
            activeNode.setProperty("cu", content);
            if (++i % 200 == 0) {
                session.save();
                System.out.printf("%s: %d units persisted.  data %s
\n", new Date(), i,
                        FileUtils.byteCountToDisplaySize(size));
            }
        }
        session.save();
        System.out.printf("%s: %d units persisted\n", new Date(), i);
    }

    @SuppressWarnings("unchecked")
    private Iterator<File> findDataFiles() {
        return FileUtils.iterateFiles(new
File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
                FileFilterUtils.suffixFileFilter(".json"), TrueFileFilter.TRUE);
    }




2011/9/26 Stefan Guggisberg <stefan.guggisberg@gmail.com>:
> hi marcel,
>
> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <marcel.bruch@gmail.com> wrote:
>> Hi,
>>
>> I'm looking for some advice whether Jackrabbit might be a good choice for my problem.
Any comments on this are greatly appreciated.
>>
>>
>> = Short description of the challenge =
>>
>> We've built a Eclipse based tool that analyzes java source files and stores its analysis
results in additional files. The workspace  potentially has hundreds of projects and each
project may have up to a few thousands of files. Say, there will be 200 projects and 1000
java source files per project in a single workspace. Then, there will be 200*1000 = 200.000
files.
>>
>> On a full workspace build, all these 200k files have to be compiled (by the IDE)
and analyzed (by our tool) at once and the analysis results have to be dumped to disk rather
fast.
>> But the most common use case is that a single file is changed several times per minute
and thus gets frequently analyzed.
>>
>> At the moment, the analysis results are dumped on disk as plain json files; one json
file for each java class. Each json file is around 5 to 100kb in size; some files grow up
to several megabytes (<10mb), these files have a few hundred JSON complex nodes (which
might perfectly map to nodes in JCR).
>>
>> = Question =
>>
>> We would like to change the simple file system approach by a more sophisticated approach
and I wonder whether Jackrabbit may be a suitable backend for this use case. Since we map
all our data to JSON already, it looks like Jackrabbit/JCR is a perfect fit for this but I
can't say for sure.
>>
>> What's your suggestion? Is Jackrabbit capable to quickly load and store json-like
data - even if 200k files (nodes + their sub-nodes) have to be updated very in very short
time?
>
> absolutely. if the data is reasonably structured/organized jackrabbit
> should be a perfect fit.
> i suggest to leverage the java package space hierarchy for organizing the data
> (i.e. org.apache.jackrabbit.core.TransientRepository ->
> /org/apache/jackrabbit/core/TransientRepository).
> for further data modeling recommondations see [0].
>
> cheers
> stefan
>
> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>
>>
>>
>> Thanks for your suggestions. I've you need more details on what operations are performed
or how data looks like, I would be glad to take your questions.
>>
>> Marcel
>>

Mime
View raw message