incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paolo Castagna (JIRA)" <>
Subject [jira] [Commented] (JENA-117) tdbloader2: Java external sorting using binary files vs. UNIX sort over text files
Date Mon, 26 Sep 2011 13:36:25 GMT


Paolo Castagna commented on JENA-117:

An experimental version of tdbloader2 is here:

It's a pure Java program and it uses a two pass algorithm even to build the node table.
It should (more testing is needed!) have better scalability properties on machines with not
much RAM available.

Here is how you can check it out and package it:

  cd /tmp
  svn co tdbloader2
  cd /tmp/tdbloader2
  mvn package

To run it:

  java -cp target/tdbloader2-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar \
       -server -d64 -Xmx6144M cmd.tdbloader2 --no-stats --compression \
       --spill-size 1500000 --loc /tmp/tdb /path/to/your/rdfdata.nt.gz

Use -h to see the options available:

  cmd.tdbloader2 --loc=DIR FILE ...
      -v   --verbose         Verbose
      -q   --quiet           Run with minimal output
      --debug                Output information for debugging
      --version              Version information
      --loc                  Location
      --compression          Use compression for intermediate files
      --buffer-size          The size of buffers for IO in bytes
      --gzip-outside         GZIP...(Buffered...())
      --spill-size           The size of spillable segments in tuples|records
      --no-stats             Do not generate the stats file
      --no-buffer            Do not use Buffered{Input|Output}Stream
      --max-merge-files      Specify the maximum number of files to merge at the same time
(default: 100)

> tdbloader2: Java external sorting using binary files vs. UNIX sort over text files
> ----------------------------------------------------------------------------------
>                 Key: JENA-117
>                 URL:
>             Project: Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
> There is probably a significant performance improvement for tdbloader2 in replacing the
UNIX sort over text files with an external sorting pure Java implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new SortedDataBag<Tuple<Long>>(policy,
serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers
around DataInputStream|DataOutputStream. TupleComparator is trivial.
> Preliminary results seems promising and show that the Java implementation can be faster
than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons
of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here:
> A further advantage in doing the sorting with Java rather than UNIX sort is that we could
stream results directly into the BPlusTreeRewriter rather than on disk and then reading them
from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant improvement.
> Using compression for intermediate files might help, but more experiments are necessary
to establish if it is worthwhile or not.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message