Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jena-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of
 andy.seaborne.apache@gmail.com designates 74.125.82.175 as permitted sender)
Sender: Andy Seaborne <andy.seaborne.apache@gmail.com>
Message-ID: <4E831271.4030604@apache.org>
Date: Wed, 28 Sep 2011 13:26:25 +0100
From: Andy Seaborne <andy@apache.org>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.2.21) Gecko/20110831 Thunderbird/3.1.13
MIME-Version: 1.0
To: jena-dev@incubator.apache.org
Subject: Re: Comparators for Triple or Node?
References: <01b001cc7d52$0e442900$2acc7b00$@com>
In-Reply-To: <01b001cc7d52$0e442900$2acc7b00$@com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 27/09/11 21:14, Stephen Allen wrote:
> All,
>
> I am working on removing memory limits for CONSTRUCT queries.  Since the
> spec requires that the triples created by the graph template be combined by
> a set union [1] (section 16.2 in the 1.1 WD), it means we need to remove
> duplicates.

Sort of - something has to remove duplicates but if they are being 
inserted into a store that does duplicate suppression then the CONSTRUCT 
need not perfectly.

DISTINCT on the WHERE part would be effective in reducing the amount of 
triples even is there is still the possibility of duplicates.

Other simple techniques like a sliding window of previously generated 
triples might also be a lot easier than perfect duplicate removal.

> As the triples do not need to be ordered (section 16.2.3), it
> seems like DistinctDataBag<Triple>  will fit the bill as a temporary storage
> container/duplicate remover.
>
> The current implementation of DistinctDataBag is implemented as an in-memory
> HashMap that spills sorted data to disk.  When the data is read back using a
> merge-sort, all duplicates will be adjacent, and can thus be easily removed.
>
> Anyway, to make a long story short, I need a Comparator<Triple>  to pass to
> the DistinctDataBag.  Do we have such a thing?  If not, what is the best way
> to implement this?  I have an attempt below, but I need some help on the
> NodeComparator class.

Not directly but there is:

NodeUtils.compareRDFTerms
NodeValue.compareAlways
NodeValue.compare

I think you want NodeValue.compareAlways.


NodeValue.compare backs =, <, > is SPARQL.

This is value-based.

123 > 45
123 < 456

Not everything is comparable. ARQ sorting does impose total order on 
bindings by using NodeValue.compare, then the SPARQL rules for unlike 
objects, then doing a guaranteed-to-have-an-answer 
NodeUtils.compareRDFTerms.

The end result is arbitrary - e.g. bNode labels are compared to order 
bNodes.

So it's value-comparison, then SPARQL type ordering, then a total 
ordering Nodes.

>
> -Stephen
>
> [1] It would be nice if it was a bag union and we had something like
> CONSTRUCT DISTINCT for set.  I'm guessing it is not possible to change the
> spec for backwards compatibility reasons with SPARQL 1.0?
123 < 45
Correct.  And also an RDF graph is a set of triples so it's not just SPARQL.

Other systems (Sesame, Redland) have or had duplicated triple results 
arguing it was then the app to sort it out if it cared (a choice of 
efficiency argument) but these have all gone now, I think.  Users wanted 
no duplicates.

	Andy

>
> ==
>
> public class TripleComparator implements Comparator<Triple>
> {
>      private final NodeComparator nc = new NodeComparator();
>
>      public int compare(Triple o1, Triple o2)
>      {
>          int toReturn = nc.compare(o1.getSubject(), o2.getSubject());
>          if (toReturn == 0)
>          {
>              toReturn = nc.compare(o1.getPredicate(), o2.getPredicate());
>              if (toReturn == 0)
>              {
>                  toReturn = nc.compare(o1.getObject(), o2.getObject());
>              }
>          }
>
>          return toReturn;
>      }
> }
>
>
> public class NodeComparator implements Comparator<Node>
> {
>      public int compare(Node o1, Node o2)
>      {
>          // TODO: What should I be doing here?
>          return
> o1.getIndexingValue().toString().compareTo(o2.getIndexingValue().toString())
> ;
>      }
> }
>
>