Return-Path: X-Original-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-jena-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A397278C6 for ; Wed, 28 Sep 2011 12:26:57 +0000 (UTC) Received: (qmail 37655 invoked by uid 500); 28 Sep 2011 12:26:57 -0000 Delivered-To: apmail-incubator-jena-dev-archive@incubator.apache.org Received: (qmail 37634 invoked by uid 500); 28 Sep 2011 12:26:57 -0000 Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-dev@incubator.apache.org Delivered-To: mailing list jena-dev@incubator.apache.org Received: (qmail 37626 invoked by uid 99); 28 Sep 2011 12:26:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Sep 2011 12:26:57 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andy.seaborne.apache@gmail.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-wy0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Sep 2011 12:26:49 +0000 Received: by wyh5 with SMTP id 5so8758978wyh.6 for ; Wed, 28 Sep 2011 05:26:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=pyI/PFEEAVWCZ7BCJeUzi3htofWwGg0xXfJywfnx0I4=; b=Qzxs79GBm+72LRozxMJorYeeagvxsa+1W6l8QfxrBc9wQChfeI3awTA8geIm4Dd+Fs a1L0bN6tpppjPy+jqf/kDdVBqjfT4BVUZ8/TNC0QYFrWtrTQfzfem8cjziQm8yNBywSg 3UJfUyH95gADDGIU1pgYpOuqtocsaegsFsivs= Received: by 10.227.60.132 with SMTP id p4mr9034641wbh.40.1317212788112; Wed, 28 Sep 2011 05:26:28 -0700 (PDT) Received: from [192.168.1.33] (82-69-1-248.dsl.in-addr.zen.co.uk. [82.69.1.248]) by mx.google.com with ESMTPS id gg18sm7044150wbb.26.2011.09.28.05.26.26 (version=SSLv3 cipher=OTHER); Wed, 28 Sep 2011 05:26:27 -0700 (PDT) Sender: Andy Seaborne Message-ID: <4E831271.4030604@apache.org> Date: Wed, 28 Sep 2011 13:26:25 +0100 From: Andy Seaborne User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.21) Gecko/20110831 Thunderbird/3.1.13 MIME-Version: 1.0 To: jena-dev@incubator.apache.org Subject: Re: Comparators for Triple or Node? References: <01b001cc7d52$0e442900$2acc7b00$@com> In-Reply-To: <01b001cc7d52$0e442900$2acc7b00$@com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 27/09/11 21:14, Stephen Allen wrote: > All, > > I am working on removing memory limits for CONSTRUCT queries. Since the > spec requires that the triples created by the graph template be combined by > a set union [1] (section 16.2 in the 1.1 WD), it means we need to remove > duplicates. Sort of - something has to remove duplicates but if they are being inserted into a store that does duplicate suppression then the CONSTRUCT need not perfectly. DISTINCT on the WHERE part would be effective in reducing the amount of triples even is there is still the possibility of duplicates. Other simple techniques like a sliding window of previously generated triples might also be a lot easier than perfect duplicate removal. > As the triples do not need to be ordered (section 16.2.3), it > seems like DistinctDataBag will fit the bill as a temporary storage > container/duplicate remover. > > The current implementation of DistinctDataBag is implemented as an in-memory > HashMap that spills sorted data to disk. When the data is read back using a > merge-sort, all duplicates will be adjacent, and can thus be easily removed. > > Anyway, to make a long story short, I need a Comparator to pass to > the DistinctDataBag. Do we have such a thing? If not, what is the best way > to implement this? I have an attempt below, but I need some help on the > NodeComparator class. Not directly but there is: NodeUtils.compareRDFTerms NodeValue.compareAlways NodeValue.compare I think you want NodeValue.compareAlways. NodeValue.compare backs =, <, > is SPARQL. This is value-based. 123 > 45 123 < 456 Not everything is comparable. ARQ sorting does impose total order on bindings by using NodeValue.compare, then the SPARQL rules for unlike objects, then doing a guaranteed-to-have-an-answer NodeUtils.compareRDFTerms. The end result is arbitrary - e.g. bNode labels are compared to order bNodes. So it's value-comparison, then SPARQL type ordering, then a total ordering Nodes. > > -Stephen > > [1] It would be nice if it was a bag union and we had something like > CONSTRUCT DISTINCT for set. I'm guessing it is not possible to change the > spec for backwards compatibility reasons with SPARQL 1.0? 123 < 45 Correct. And also an RDF graph is a set of triples so it's not just SPARQL. Other systems (Sesame, Redland) have or had duplicated triple results arguing it was then the app to sort it out if it cared (a choice of efficiency argument) but these have all gone now, I think. Users wanted no duplicates. Andy > > == > > public class TripleComparator implements Comparator > { > private final NodeComparator nc = new NodeComparator(); > > public int compare(Triple o1, Triple o2) > { > int toReturn = nc.compare(o1.getSubject(), o2.getSubject()); > if (toReturn == 0) > { > toReturn = nc.compare(o1.getPredicate(), o2.getPredicate()); > if (toReturn == 0) > { > toReturn = nc.compare(o1.getObject(), o2.getObject()); > } > } > > return toReturn; > } > } > > > public class NodeComparator implements Comparator > { > public int compare(Node o1, Node o2) > { > // TODO: What should I be doing here? > return > o1.getIndexingValue().toString().compareTo(o2.getIndexingValue().toString()) > ; > } > } > >