incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Seaborne (JIRA)" <>
Subject [jira] [Commented] (JENA-99) Spill to disk data bags
Date Tue, 16 Aug 2011 13:41:27 GMT


Andy Seaborne commented on JENA-99:

1/ Not having the parallel sort-merge in JENA-44 

a) What happens if there is no disk?  (I haven't had the chance to work through all the code
yet - for JENA-44 and JENA-45, coping when running on an in-memory dataset is important for
reduced support costs).

d) This adds functionality without affecting anything else so I have applied the patch and
we can discuss the code in-place. All the clearing up related to it looks good to have anyway.

A/ Looking at:BindingComparator.compareBindingsSyntactic:

Currently it builds a list which is the concat of the varibales liost of two bindings.  If
any variables are in common, it's going to have duplicates.  I think this needs to be a unique

Maybe we can avoid the sorting of variables by defining a comparison that uses length and
only worries about variable order when two bindings are found to have the same length but
different variable lists.

The only use is when a SortCondition isn't sufficiently separating on two Bindings - we can
put in place an arbirary choice providing it's a well defined, stable condition.

In other words, the order space is sectioned by the size of variable bindings, less variables
comparing before more variables.

NodeUtils.compareRDFTerms copes with passing in nulls but we'd have to test in the compare
loop for any nulls and drop to full comparison.

B/ SerializationFactoryFinder does not seem to be used.

C/ I moved SerializationFactory ->

D/ BagFactory: I renamed <Tuple> -> <T>

Just a preference, but there are already "Tuples" elesewhere and I was confused for a while,
also by being used to single capitals for generic types.

> Spill to disk data bags
> -----------------------
>                 Key: JENA-99
>                 URL:
>             Project: Jena
>          Issue Type: New Feature
>          Components: ARQ
>            Reporter: Stephen Allen
>         Attachments: JENA-99-r1157891.patch
> For certain query operations, ARQ needs to store a large number of tuples temporarily.
 Currently these are stored in Java Collections, however for large result sets the system
can exhaust the available memory.  There is a need for a set of generic data structures that
can hold these tuples and spill to disk if they get too large.
> ==
> The design is inspired by Apache Pig's DataBag [1]:
> A DataBag is a collection of tuples. A DataBag may or may not fit into memory. It proactively
spills to disk when its size exceeds the threshold. When it spills, it takes whatever it has
in memory, opens a spill file, and writes the contents out. This may happen multiple times.
The bag tracks all of the files it's spilled to. The spill behavior is controlled by a ThresholdPolicy
object.  The most basic policy spills based on the number of tuples added.  A more advanced
policy is to estimate the size of all the tuples added to the DataBag and spill when it passes
a byte threshold.
> A DataBag provides an Iterator interface, that allows callers to read through the contents.
The iterators are aware of the data spilling. They have to be able to handle reading from
the spill files. 
> The DataBag interface assumes that all data is written before any is read. That is, a
DataBag cannot be used as a queue. If data is written after data is read, the results are
> DataBags come in several types, default, sorted, and distinct. The type must be chosen
up front, there is no way to convert a bag on the fly. Default data bags do not guarantee
any particular order of retrieval for the tuples and may contain duplicate tuples. Sorted
data bags guarantee that tuples will be retrieved in order, where "in order" is defined either
by the default comparator for the tuple or the comparator provided by the caller when the
bag was created. Sorted bags may contain duplicates. Distinct bags do not guarantee any particular
order of retrieval, but do guarantee that they will not contain duplicate tuples. 
> The DataBags are generic containers, and may store any item that can be serialized and
deserialized.  It accepts a SerializationFactory that handles this task.
> [1]

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message