incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: JENA-44, JENA-45 etc - common Binding I/O
Date Fri, 12 Aug 2011 13:29:50 GMT
Hi Andy
first of all, apologies for the late reply on this.

Andy Seaborne wrote:
> On 23/06/11 17:07, Paolo Castagna wrote:
>> Hi Andy,
>> first of all, thanks for this.
>> Re: JENA-44... what is blocking JENA-44 going into trunk is just the
>> lack of a
>> common way to serialize binding. By the way, we are using a patched
>> version of
>> ARQ on some of our servers (with no problem and improvements in terms of
>> stability, RAM consumption in particular when users submit queries which
>> need
>> to sort large resultsets and they timeout).
>> So, all this is more than welcome from my point of view (i.e. one patch
>> less
>> to manage).
> Have you looked at the DeferredFileQueue / ThresholdPolicy code in
> JENA-45?  This is another area of commonality.

ThresholdPolicyCount can be used for JENA-44 as well.

Maybe the stuff* and org.openjena.riot.* from
JENA-45 can be committed so that it can be used for JENA-44 as well.

However, DeferredFileQueue does not currently provide any way to sort the
items before spilling them to disk. So, we would need something similar
but a DeferredSortingFileQueue. Do you agree?

What we do in ExternalBindingSort is to buffer a certain number of bindings
(by default 4000), we sort them and write them to disk. Then we repeat with
the next 4000.

> Any thoughts about
> DataBag from Pig?  (JENA-44, comment 24/May/11, pt 3 - this mikght be
> too much for this round).

Something similar to SortedDataBag is what's needed for JENA-44.
DeferredFileQueue from JENA-45 is similar to DefaultAbstractBag.
The biggest difference is that Pig uses a SpillableMemoryManager instead
of fixed thresholds.

We could start committing JENA-45 and JENA-44 as they are (or with minimal
changes) with fixed and sensible thresholds and configuration parameters.

Then we could discuss a more general memory manager system which would need
to control when to spill to disk. But, I don't see this as a blocker for
JENA-45 nor JENA-44.

The DataBag hierarchy from Pig is something we can be inspired by (i.e.
copy ideas) but the code would need to be changed a lot to adapt to our

> There are various settable paramters - what makes a difference?
> especially writeBufferSize. has the following settable parameters:

  externalSortBufferSize (default value is 4000)
  externalSortWorkers (default value is 1)
  externalSortDir (default to what specified by

writeBufferSize is set to 10MB.
Maybe we should make that configurable as well.

Also, 10MB perhaps is too high with a externalSortBufferSize of only 4000 bindings.

The aim of having all these parameters configurable via ARQ's symbols is to
allow people to make experiments and find the optimal configuration for their

I seem to remember a spreadsheet with a few experiments but it could be something
unrelated to these parameters. In any case I can add a sort of micro-benchmark for
this to the src-dev area as part of JENA-44.

> I didn't notice how cancellation would stop executors, only clear up
> afterwards.  What about a volatile flag?

Right. Once executors start they run to completion. However, we create a new
executor every externalSortBufferSize (by default 4000) bindings and only if
the iterator has not being canceled.

Yes, we can add a flag to stop executors immediately as soon as the iterator
gets canceled.


>>> VARS ?x ?y .
>>> Set the variables in force for subsequent rows,
>>> until the next VARS directive.
>>> We need VARS because it's not always possible to determine all
>>> the possible variables before starting to write out bindings.
>> This is not completely clear to me. An example of when it's not possible
>> to determine all the possible variables before starting to write out
>> binding
>> will probably convince me and help me to clarify.
> Support you have an Iterator<Binding> from a LeftJoin or a Union.  One
> way is to statically determine the variables, the other is to be relaxed
> and output based on the Bindings seen.  Static analysis requires the
> info to be passed from query execution into, for example, the heart of
> The first might have ?x, ?z, the second ?x, ?y, ?z from an OPTIONAL. The
> separation of the code from the static analysis
> If you set it once at the start, that also works.
> And you can concat streams.
>     Andy

View raw message