incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: tdbloader3: getting closer... help welcome!
Date Fri, 16 Sep 2011 09:14:46 GMT
Paolo Castagna wrote:
>  - Add MiniMRCluster so that it is easy for developers to run tests with multiple reducers
on a laptop.


>  - Split the first MapReduce job into two: one to produce offset values for each partition,
the other to generate data files with correct ids for subsequent jobs.


>  - Build the node table concatenating output files from the MapReduce jobs above.


All the changes are in a branch, here:

There is only one final step which is currently not done using MapReduce:
the node2id.dat|idn files (i.e. the B+Tree index to map RDF node hashes of 128 
bits to RDF node ids (68 bits)) are built from the nodes.dat file at the end
of all MapReduce jobs.

Iterator<Pair<Long,ByteBuffer>> iter = objects.all();
while ( iter.hasNext() ) {
     Pair<Long, ByteBuffer> pair =;
     long id = pair.getLeft() ;
     Node node = NodeLib.fetchDecode(id, objects) ;
     Hash hash = new Hash(recordFactory.keyLength()) ;
     setHash(hash, node) ;
     byte k[] = hash.getBytes() ;
     Record record = recordFactory.create(k) ;
     Bytes.setLong(id, record.getValue(), 0) ;

I need to run a few experiments, but this saves a find() to search if a record
is already in the index. We know the objects file contains only unique RDF node

Indeed, while I was doing this I looked back at tdbloader2 and I think we could
use the BPlusTreeRewriter 'trick' for the node table as well. I cannot reuse
BPlusTreeRewriter as it is since it has been written for SPO or GSPO, etc.
indexes where we have records with 3 or 4 slots of constant size (in this case
64 bits).

In the case of the node table we have records of two slots only respectively of
size 128 bits for the hash and 68 bits for the node id.

I am keen to try to improve the first phase of the tdbloader2 since I expect it
could further improve performances and scalability (in particular when the node
table indexes do not fit in RAM anymore).

@Andy, does this idea make sense?

>  - Test on a cluster with a large (> 1B dataset).



View raw message