incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: Blank nodes and MapReduce
Date Tue, 28 Jun 2011 15:48:38 GMT
Hi Andy,
thanks for your suggestions (I still need to find the best way to set
a UUID per job run across the cluster... probably with some configuration
property or maybe I can use a job id).

The following is just to clarify what I have done so far and why.

With MapReduce we can have multiple files (a.k.a. input paths) per job.

For example, let's take these two files (each file could be split into
multiple chunks):

"File 1":

<foo:x> <foo:y> _:bnode1 .
_:bnode1 <foo:z> "1" .

"File 2":

<foo:x> <foo:y> _:bnode1 .
_:bnode1 <foo:z> "2" .

These are (key,value) pairs in input and output for the map function which
is processing "File 1":

< (0, [urn:x-arq:DefaultGraphNode foo:x foo:y mrbnode_-1648518150_bnode1])
> (<foo:x>, 3414670bf0215c8170aed7c8056103a4|S)
> (<foo:y>, 3414670bf0215c8170aed7c8056103a4|P)
> (_:mrbnode_-1648518150_bnode1, 3414670bf0215c8170aed7c8056103a4|O)

< (27, [urn:x-arq:DefaultGraphNode mrbnode_-1648518150_bnode1 foo:z "1"])
> (_:mrbnode_-1648518150_bnode1, 5c240b5b786700371abaa72033ab479e|S)
> (<foo:z>, 5c240b5b786700371abaa72033ab479e|P)
> ("1", 5c240b5b786700371abaa72033ab479e|O)

The blank node label "mrbnode_-1648518150_bnode1" has the hash code of the
input path in it. I guess, if I understood correctly your suggestion, I
should add a unique random number across the whole cluster so that each run
will generate different blank node labels.

These are (key,value) pairs for "File 1":

< (0, [urn:x-arq:DefaultGraphNode foo:x foo:y mrbnode_-1648488359_bnode1])
> (<foo:x>, 8fbdd7930f86f7705167ca3455136a38|S)
> (<foo:y>, 8fbdd7930f86f7705167ca3455136a38|P)
> (_:mrbnode_-1648488359_bnode1, 8fbdd7930f86f7705167ca3455136a38|O)

< (27, [urn:x-arq:DefaultGraphNode mrbnode_-1648488359_bnode1 foo:z "2"])
> (_:mrbnode_-1648488359_bnode1, 96eae6c91f10df75cac8dbb7c4004732|S)
> (<foo:z>, 96eae6c91f10df75cac8dbb7c4004732|P)
> ("2", 96eae6c91f10df75cac8dbb7c4004732|O)

Notice: files have both "<foo:x> <foo:y> _:bnode1" but they generate a
different triple/quad 'unique' id and different blank node labels.

This is how I use RIOT to parse triples|quads in my first MapReduce job:

  Path path = ... // this is the input file, each file split has the same path

  Prologue prologue = new Prologue(null, IRIResolver.createNoResolve());
  LabelToNode labelMapping = new MapReduceLabelToNode(path);
  ParserProfile profile = new MapReduceParserProfile(prologue,
      ErrorHandlerFactory.errorHandlerStd, labelMapping);
  Tokenizer tokenizer = TokenizerFactory.makeTokenizerASCII(value.toString()) ;
  LangNQuads parser = new LangNQuads(tokenizer, profile, null) ;

This is how I use RIOT to output intermediate (key,value) pairs:

  StringWriter out = new StringWriter();
  OutputLangUtils.output(out, node, null,

The second and third MapReduce jobs in tdbloader3 use node ids.

I have a test [1] which check (i.e. isomorphism) if the TDB indexes generated
by tdbloader3 and ones created by usual TDB loaders are the same. The trick
of adding the hash code of the input path to the blank node labels is the
only one I found so far which pass that test with the input showed above.



Andy Seaborne wrote:
> On 28/06/11 14:37, Paolo Castagna wrote:
>> Andy Seaborne wrote:
>>>>    public Node create(String label) {
>>>>        return Node.createAnon(new AnonId(filename + "-" + label)) ;
>>>>    }
>>> The way I thought was to allocate a UUID per parser run (or any other
>>> sufficiently large random number), xor the label into the UUID to
>>> produce the bNode label.  This is a non-localised label allocation
>>> scheme.
>> Hi Andy,
>> I am not sure this would work with MapReduce as filers are split into
>> multiple
>> chunks and different machines can process splits from the same file.
> Exactly - by "parser run" I mean all the separate parsing actions in one
> step of the process.  Allocate one large job random number as the base
> of bNode label generation across the whole cluster.
> Per job instance, means it's different next time, important if the data
> is merged with other data.
>> Let's say I have this file, split into two chunks:
>>    ----------------------------
>>    <foo:bar>  <foo:p>  _:bnode1 .      split 1
>>    _:bnode1<foo:q>  "1" .
>>    ----------------------------
>>    _:bnode1<foo:r>  "2" .            split 2
>>    ----------------------------
>> I need to ensure the 'bnode1' label in split 1 and 2 refers to the
>> same blank
>> node even if the splits are parsed separately. However, the same
>> 'bnode1' label
>> from a different file must represent a different blank node. In
>> practice, with
>> MapReduce, I cannot assume that a file is parsed in a single "parser
>> run".
>>>> Therefore, I would like to have my own
>>>> LabelToNode implementation with an Allocator<String, Node>   which
>>>> takes into
>>>> account the filename (or an hash of it) when it creates a new blank
>>>> node.
>>>> But LabelToNode constructor is private.
>>>> Could we make it protected?
>>> Now public.
>> Thanks.
>> Paolo
>>>> Or, alternatively, how can I construct a LabelToNode object which will
>>>> be using
>>>> my MapReduceAllocator?
>>> LabelToNode createUseLabelAsGiven()
>>>      Andy

View raw message