incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <>
Subject TDB and hash ids
Date Thu, 03 Nov 2011 13:51:12 GMT
On 03/11/11 13:19, Paolo Castagna wrote:

>  From my experience with tdbloader3 and parallel processing, I say that
> the fact that current node ids (currently 64 bits) are offsets in the
> nodes.dat file is a big "impediment" to distributed/parallel processing.
> Mainly, because whatever you do, you first need to build a dictionary
> for that and this is not trivial in parallel.

Loading, I agree; but general distributed/parallel processing?

> However, if we could, given an RDF node value generate a node id with
> an hash function (sufficiently big so that the probability of collision
> is less than being hit by an asteroid) (128 bits?) then tdbloader3 could
> be massively simplified, merging TDB indexes directly will become trivial
> (as for Lucene indexes), ... my life at work would be so much simpler!

Are you going to test out using hashes of ids in TDB?

It needs someone to actually try it out in an experimental branch.  What 
about the consequences of turning ids back into Nodes for results (which 
could be done in parallel to much of query evaluation).

> The drawback of 128 bit node ids is that suddenly you might need to
> double your RAM to achieve same performances (to be proven and verified
> with experiments). However, there are many other good side effects that
> you can fit into 128 bits. For example, I am not so sure anymore if
> an optimization such as the one proposed on JENA-144 is possible without
> ensuring that all node values can be encoded in the bits available in
> the node id:

Do the calculation on clash probabilities.


Re: JENA-144:

?? Use the same scheme as present - section the id space into values 
and, separately, also hashes.  Cost : one bit.


View raw message