incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: TDB and hash ids
Date Thu, 03 Nov 2011 15:34:59 GMT
Hi Andy

Andy Seaborne wrote:
> On 03/11/11 13:19, Paolo Castagna wrote:
>>  From my experience with tdbloader3 and parallel processing, I say that
>> the fact that current node ids (currently 64 bits) are offsets in the
>> nodes.dat file is a big "impediment" to distributed/parallel processing.
>> Mainly, because whatever you do, you first need to build a dictionary
>> for that and this is not trivial in parallel.
> Loading, I agree; but general distributed/parallel processing?

The fact that you can read some RDF on different machines and quickly
move from the RDF node value space to the node id space simply computing
an hash function over the RDF node value makes all sort of processing
easier and faster since node ids are smaller than node values and for
a lot of processing (i.e. counting/stats even inferencing) you just need
equality between your RDF nodes, not the actual value.

In a lot of MapReduce jobs I use RDF node values, that is fine... but
shuffling RDF node values around isn't great. So, you want to move to
the node ids space, do all your processing there and go back to node
values only at the end. Same as for a query engine.

Another example of this is WebPIE (i.e. inference with MapReduce):
... they use a dictionary to reduce the data the move around.

But, my remarks were in the context of how to load data in parallel
and specifically how to produce the TDB indexes in parallel (with or
without MapReduce).

>> However, if we could, given an RDF node value generate a node id with
>> an hash function (sufficiently big so that the probability of collision
>> is less than being hit by an asteroid) (128 bits?) then tdbloader3 could
>> be massively simplified, merging TDB indexes directly will become trivial
>> (as for Lucene indexes), ... my life at work would be so much simpler!
> Are you going to test out using hashes of ids in TDB?

Yes, I'd like to look at this and explore what are the options, pros/cons.
However, there are things I am not sure and I need to investigate first.

Will hashes of 64 bits be fine?

What would be a good hash function to use?

How to instruct TDB to use hashes or offsets id?

  - System property?
  - TDB symbols (a la ARQ symbols)
  - meta info files? (I've never used those, so I need to learn how to do that)

But, yes, I'd like to provide a NodeTable implementation for TDB which uses
hashes instead of file offset for node ids.

Do you have suggestions for this?

I find it easier to find and track info when there is a JIRA issue associated
to it. This is clearly and experiment and/or new feature. Do you mind if we
create a new one (assigned to me)?

I have not time to do anything on this for the next two weeks, but after
that I'd like to give it a go with your guidance so we make sure I make
progress in the right direction. :-)

> It needs someone to actually try it out in an experimental branch.  


> What 
> about the consequences of turning ids back into Nodes for results (which 
> could be done in parallel to much of query evaluation).

Yep. Good question.

>> The drawback of 128 bit node ids is that suddenly you might need to
>> double your RAM to achieve same performances (to be proven and verified
>> with experiments). However, there are many other good side effects that
>> you can fit into 128 bits. For example, I am not so sure anymore if
>> an optimization such as the one proposed on JENA-144 is possible without
>> ensuring that all node values can be encoded in the bits available in
>> the node id:
> Do the calculation on clash probabilities.

You are the mathematician here, if you say X bits will do, I trust you. :-)

Seriously, do you think the hash should be 128 bits or less?

> e.g.
> doi:
> doi:
> Re: JENA-144:
> ?? Use the same scheme as present - section the id space into values 
> and, separately, also hashes.  Cost : one bit.

The problem I see (and I do not know how to solve) is: what do we currently
do if an RDF node value does not fit the 63 bits and therefore it cannot
be encoded in the node id?

We put it in the node table (i.e. nodes.dat file) as the other node values.

At query time, say we implement the optimization described in JENA-144 and
we search for something between to values. How do we know that all the
values in that range were small enough to fit into 63 bits? How do we know
if we do not have one value that we were not able to encode in the 63 bits
and we needed to put in the node table?

That optimization would give a correct (but incomplete) answer with a few
values missing (because they are in the node table). Isn't it?

Please, correct me if I am wrong on this. Maybe there is something I do not
fully understand or I do not consider.

I would love to be wrong of this and start working on JENA-144 (and close it)
but I am worried that what I described above is a blocker for this.


>     Andy

View raw message