incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: JENA-44, JENA-45 etc - common Binding I/O
Date Fri, 12 Aug 2011 15:38:03 GMT

Stephen Allen wrote:
> Hi Paolo,
> I was thinking along the same lines in terms of unifying the patches.  Designing an interface
inspired by Pig's DataBags seems to make sense.  We would need three bag implementations (unorder,
sorted, and distinct).  A good starting point would be to have each bag proactively spill
when thresholds are passed rather than an external memory manager.  Pig does in fact have
these (InternalCachedBag, InternalSortedBag, and InternalDistinctBag [1]).
> I'm not completely happy with the implementation I have for JENA-45.  I'd like to redesign
it a bit, as well as unify JENA-44.  The design will be guided by Pig, but with some simplifications.
> 1) Create implementations for the 3 bag types
> 2) Bags will be generic and accept serializer objects (to handle the different tuple
types: Bindings, Triples, and Quads)
> 3) Bags will proactively spill (this greatly simplifies things because there is no need
to deal with synchronization)
> 4) Spill based on the estimated memory size of the tuples [2] instead of just the cardinality
(the serializer object can generate the estimate)
> My plan is to try to work on it early next week.

Hi Stephen,
thanks for your response.

I think we should create a new JIRA issue (which JENA-44 and JENA-45 depends on),
just to work on these three types of bags (i.e. unsorted, sorted and distinct)
which will spill on disk once they reach a threshold.

Do we need sorted+distinct as well?

If we create a new issue we can work together on that. Hopefully commit it quickly
and then make progress on JENA-44 and JENA-45 in parallel and independently.


> -Stephen
> [1]
> [2] I plan to estimate the size of the bindings/triples/quads by examining the values
and retrieving their string lengths.  This will prevent costly serialization until we actually
have to spill.  We could instead get an average tuple size based on say the first 100 items
added to the bag.
> -----Original Message-----
> From: Paolo Castagna [] 
> Sent: Friday, August 12, 2011 9:30 AM
> To:
> Subject: Re: JENA-44, JENA-45 etc - common Binding I/O
> Hi Andy
> first of all, apologies for the late reply on this.
> Andy Seaborne wrote:
>> On 23/06/11 17:07, Paolo Castagna wrote:
>>> Hi Andy,
>>> first of all, thanks for this.
>>> Re: JENA-44... what is blocking JENA-44 going into trunk is just the
>>> lack of a
>>> common way to serialize binding. By the way, we are using a patched
>>> version of
>>> ARQ on some of our servers (with no problem and improvements in terms of
>>> stability, RAM consumption in particular when users submit queries which
>>> need
>>> to sort large resultsets and they timeout).
>>> So, all this is more than welcome from my point of view (i.e. one patch
>>> less
>>> to manage).
>> Have you looked at the DeferredFileQueue / ThresholdPolicy code in
>> JENA-45?  This is another area of commonality.
> ThresholdPolicyCount can be used for JENA-44 as well.
> Maybe the stuff* and org.openjena.riot.* from
> JENA-45 can be committed so that it can be used for JENA-44 as well.
> However, DeferredFileQueue does not currently provide any way to sort the
> items before spilling them to disk. So, we would need something similar
> but a DeferredSortingFileQueue. Do you agree?
> What we do in ExternalBindingSort is to buffer a certain number of bindings
> (by default 4000), we sort them and write them to disk. Then we repeat with
> the next 4000.
>> Any thoughts about
>> DataBag from Pig?  (JENA-44, comment 24/May/11, pt 3 - this mikght be
>> too much for this round).
> Something similar to SortedDataBag is what's needed for JENA-44.
> DeferredFileQueue from JENA-45 is similar to DefaultAbstractBag.
> The biggest difference is that Pig uses a SpillableMemoryManager instead
> of fixed thresholds.
> We could start committing JENA-45 and JENA-44 as they are (or with minimal
> changes) with fixed and sensible thresholds and configuration parameters.
> Then we could discuss a more general memory manager system which would need
> to control when to spill to disk. But, I don't see this as a blocker for
> JENA-45 nor JENA-44.
> The DataBag hierarchy from Pig is something we can be inspired by (i.e.
> copy ideas) but the code would need to be changed a lot to adapt to our
> needs.
>> There are various settable paramters - what makes a difference?
>> especially writeBufferSize.
> has the following settable parameters:
>   externalSortBufferSize (default value is 4000)
>   externalSortWorkers (default value is 1)
>   externalSortDir (default to what specified by
> writeBufferSize is set to 10MB.
> Maybe we should make that configurable as well.
> Also, 10MB perhaps is too high with a externalSortBufferSize of only 4000 bindings.
> The aim of having all these parameters configurable via ARQ's symbols is to
> allow people to make experiments and find the optimal configuration for their
> systems.
> I seem to remember a spreadsheet with a few experiments but it could be something
> unrelated to these parameters. In any case I can add a sort of micro-benchmark for
> this to the src-dev area as part of JENA-44.
>> I didn't notice how cancellation would stop executors, only clear up
>> afterwards.  What about a volatile flag?
> Right. Once executors start they run to completion. However, we create a new
> executor every externalSortBufferSize (by default 4000) bindings and only if
> the iterator has not being canceled.
> Yes, we can add a flag to stop executors immediately as soon as the iterator
> gets canceled.
> Paolo
>>>> VARS ?x ?y .
>>>> Set the variables in force for subsequent rows,
>>>> until the next VARS directive.
>>>> We need VARS because it's not always possible to determine all
>>>> the possible variables before starting to write out bindings.
>>> This is not completely clear to me. An example of when it's not possible
>>> to determine all the possible variables before starting to write out
>>> binding
>>> will probably convince me and help me to clarify.
>> Support you have an Iterator<Binding> from a LeftJoin or a Union.  One
>> way is to statically determine the variables, the other is to be relaxed
>> and output based on the Bindings seen.  Static analysis requires the
>> info to be passed from query execution into, for example, the heart of
>> The first might have ?x, ?z, the second ?x, ?y, ?z from an OPTIONAL. The
>> separation of the code from the static analysis
>> If you set it once at the start, that also works.
>> And you can concat streams.
>>     Andy

View raw message