incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paolo Castagna (Commented) (JIRA)" <>
Subject [jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3
Date Fri, 02 Mar 2012 17:31:58 GMT


Paolo Castagna commented on JENA-117:

Hi Sarven, here are a few answers to your questions.

> --compression Use compression for intermediate files 
> --gzip-outside GZIP...(Buffered...()) 
> --buffer-size The size of buffers for IO in bytes 
> --no-buffer Do not use Buffered{Input|Output}Stream

Those are all options to control DataOutputStream/DataInputStream which are used during the
processing. In you can find this:

  if ( ! buffered ) {
      return new DataOutputStream( compression ? new GZIPOutputStream(out) : out ) ;
  } else {
      if ( gzip_outside ) {
          return new DataOutputStream( compression ? new GZIPOutputStream(new BufferedOutputStream(out,
buffer_size)) : new BufferedOutputStream(out, buffer_size) ) ;
      } else {
          return new DataOutputStream( compression ? new BufferedOutputStream(new GZIPOutputStream(out,
buffer_size)) : new BufferedOutputStream(out, buffer_size) ) ;                

This is me trying (with experiment) to find the best combination. I still do not have an answer,
I welcome suggestions and results from experiments. That is the reason why I put those configuration
parameters on the command line. Ideally, when we find what is working best, we should use
that as default and either eliminate the parameters or leave them in for advanced users only.
The buffer-size is 8192 bytes by default. 

> --spill-size The size of spillable segments in tuples|records
> --spill-size-auto Automatically set the size of spillable segments 
> --max-merge-files Specify the maximum number of files to merge at the same time (default:

These are two other advanced users only parameters to allow experiments and find out what
works best. tdbloader3 is using 'data bags' which spill data on disk because we cannot assume
data at any stage to fit into RAM and we want to avoid disk seeks. So, for example, if we
want to sort some data which do not fit in RAM we do in RAM in chunks, then dump to disk,
process another chunk, etc. at the end we sort-merge all the chunks. --spill-size parameter
control how many tuples you can keep in RAM before spilling to disk. This is not easy to know,
it also depends on how many bytes per tuple and tuples are different sizes at different stages
of computation. Ideally, users should not even think about this. This is why I tried to have
an adaptive strategy (i.e. --spill-size-auto). With --spill-size-auto tdbloader3 will constantly
monitor RAM available in the JVM and trigger the spilling on disk when the available RAM approaches
a certain threshold. Things are more complicated if you have multiple threads and I am still
unsure if this is a good stragegy or not. The aim is to have autotuning on by default and
don't have users to think about spill sizes (see also: JENA-126 and JENA-157). --max-merge-files
is used to specify the maximum number of files/chunks to sort-merge after each chunk has been
sorted and spilled on disk. So, for example, if you end-up with 10000 temporary files, the
sort-merge will happy in two rounds: in the first round it generates 100 new files (sort-merging
100 files at the time) and then a last round to sort-merge the last 100 new generated files.
This is because reading from too many files at the same time does not work well. Why 100?
It says in the Hadoop source code that they found 100 works best for them... when doing a
very similar thing. Here is another area where more experiments will help in finding a reasonable
default value. 

> --no-stats Do not generate the stats file 

This is easy: by default tdbloader3 generates the stats.opt file (see "Choosing the optimizer
strategy" section here:
You can ignore that option, stats.opt file can be generated later via TDB's tdbstats command

Now, your errors:

> $ java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server
-d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size 1500000 --> loc /usr/lib/fuseki/DB/WorldBank
> INFO Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST
> ERROR [line: 1, col: 13] Unknown char: (0) 

I think this is because your are tying to load a .gz which contains a tar with multiple files.
tdbloader3 does not support that.
My advice is to convert and validate all your files from whatever format you have into N-Triples
or N-Quads. 
Concatenate all the N-Triples or N-Quads files into a single .nt or .nq file and gzip it so
that you end up with a single filename.nt.gz (which contains a single file).
Try loading that using tdbloader2 on a 64 bit machine with as much as RAM you have and use
-Xmx2048m for the JVM.
If you try tdbloader3 as well on the same machine, give the JVM as much RAM as you can via
-Xmx... since tdbloader3 does not use memory mapped files.

> $ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE { ?s ?p
?o . } LIMIT 100'
> 10:56:30 WARN ModTDBDataset :: Unexpected: Not a TDB dataset for type DatasetTDB 

Please, double check your tdb2.worldbank.ttl is pointing at the right directory.

> One final thing I'd like to know how to do is assigning graph names. --graph is not available
as it was in tdbloader. 

Right. One way to go around this would be to use files in N-Quads (
instead of N-Triples format. 

I have worked on tdbloader3 only "out-of-band", but things might change (if there are people
You are not the only one needed some patience when dealing with > 500 million datasets.
One dataset I want to experiment with is Freebase (i.e. ~ 600 million triples) and I have
only 8 GB of RAM on my desktop. This certainly is a good experiment for tdbloader3.
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>                 Key: JENA-117
>                 URL:
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
> There is probably a significant performance improvement for tdbloader2 in replacing the
UNIX sort over text files with an external sorting pure Java implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new SortedDataBag<Tuple<Long>>(policy,
serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers
around DataInputStream|DataOutputStream. TupleComparator is trivial.
> Preliminary results seems promising and show that the Java implementation can be faster
than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons
of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here:
> A further advantage in doing the sorting with Java rather than UNIX sort is that we could
stream results directly into the BPlusTreeRewriter rather than on disk and then reading them
from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant improvement.
> Using compression for intermediate files might help, but more experiments are necessary
to establish if it is worthwhile or not.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message