Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jena-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of castagna.lists@googlemail.com
 designates 74.125.82.41 as permitted sender)
Message-ID: <4E731386.701@googlemail.com>
Date: Fri, 16 Sep 2011 10:14:46 +0100
From: Paolo Castagna <castagna.lists@googlemail.com>
User-Agent: Thunderbird 2.0.0.24 (X11/20101027)
MIME-Version: 1.0
To: jena-dev@incubator.apache.org
Subject: Re: tdbloader3: getting closer... help welcome!
References: <4E6DDFFD.6090208@googlemail.com>
In-Reply-To: <4E6DDFFD.6090208@googlemail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Paolo Castagna wrote:
> TODO:
> 
>  - Add MiniMRCluster so that it is easy for developers to run tests with multiple reducers on a laptop.

Done.

>  - Split the first MapReduce job into two: one to produce offset values for each partition, the other to generate data files with correct ids for subsequent jobs.

Done.

>  - Build the node table concatenating output files from the MapReduce jobs above.

Done.

All the changes are in a branch, here:
https://github.com/castagna/tdbloader3/tree/hadoop-0.20.203.0

There is only one final step which is currently not done using MapReduce:
the node2id.dat|idn files (i.e. the B+Tree index to map RDF node hashes of 128 
bits to RDF node ids (68 bits)) are built from the nodes.dat file at the end
of all MapReduce jobs.

Iterator<Pair<Long,ByteBuffer>> iter = objects.all();
while ( iter.hasNext() ) {
     Pair<Long, ByteBuffer> pair = iter.next();
     long id = pair.getLeft() ;
     Node node = NodeLib.fetchDecode(id, objects) ;
     Hash hash = new Hash(recordFactory.keyLength()) ;
     setHash(hash, node) ;
     byte k[] = hash.getBytes() ;
     Record record = recordFactory.create(k) ;
     Bytes.setLong(id, record.getValue(), 0) ;
     nodeToId.add(record);
}

I need to run a few experiments, but this saves a find() to search if a record
is already in the index. We know the objects file contains only unique RDF node
values.

Indeed, while I was doing this I looked back at tdbloader2 and I think we could
use the BPlusTreeRewriter 'trick' for the node table as well. I cannot reuse
BPlusTreeRewriter as it is since it has been written for SPO or GSPO, etc.
indexes where we have records with 3 or 4 slots of constant size (in this case
64 bits).

In the case of the node table we have records of two slots only respectively of
size 128 bits for the hash and 68 bits for the node id.

I am keen to try to improve the first phase of the tdbloader2 since I expect it
could further improve performances and scalability (in particular when the node
table indexes do not fit in RAM anymore).

@Andy, does this idea make sense?

>  - Test on a cluster with a large (> 1B dataset).

Soon...

Paolo