incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: tdbloader3 : time to incoproate in the codebase?
Date Thu, 08 Mar 2012 13:04:12 GMT
Hi Andy

Andy Seaborne wrote:
> Paolo,
>> Both tdbloader3 [1] and tdbloader4 [2] are (should be?) correct,
>> I've been testing them with datasets in the 500-700 million triples
>> range but I consider them (still) *experimental*.
> Is now the right time to incorporate tdbloader3 into the main TDB
> codebase as "tdbloader3"?

Yes. I'll do it, soon after TDB is released and the [VOTE] closes.

It has no additional dependencies, other than TDB. :-)
Tests are logging out at INFO level, I need to double check that
and make it silent. There are just 6 and they run in ~10 seconds.
I also want to check I am using all the new stuff to create TDB
stuff... but this, again, isn't necessarily something which needs
to be done before we incorporate it.

> It does not disturb anything else (does it?) and makes it more
> accessible to users to try out.

Correct, it does not disturb anything else and it will be easier
for others to try out (and, eventually, use).

The big advantage is that, it should scale better on machines
with lower RAM constraints. The external sort is pure Java and
it's faster than UNIX sort because we can use binary files
instead of text files to sort our 64 bits node ids.

The draw back is that the first phase to build the node table
and the relative index (i.e., node2id.idx and
node2id.dat) is done in multiple passes.

> Or ... what does it take for it not to be "experimental"?

I'd like to run a couple of more tests with ~1 billion size datasets,
but this can happen after tdbloader3 has been incorporated into TDB.

... and, last but not least, similar tests for tdbloader4 (i.e. the
MapReduce implementation). :-)

Next? Anyone into jCUDA? We all have hundreds of cores in our GPUs
sitting most of the time idle. Maybe sorting stuff there is faster,
even if I don't believe is going to do much of the difference for
the first phase.

I also want to continue looking to the hash values as node ids...


>     Andy

View raw message