Mailing-List: contact jena-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jena-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4E664586.4000308@epimorphics.com>
Date: Tue, 06 Sep 2011 17:08:38 +0100
From: Andy Seaborne <andy.seaborne@epimorphics.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.2.21) Gecko/20110831 Thunderbird/3.1.13
MIME-Version: 1.0
To: jena-dev@incubator.apache.org
Subject: LD-Access (was: On SPARQL queries with ORDER BY + OFFSET + LIMIT)
References: <4E613E03.5060702@googlemail.com>
In-Reply-To: <4E613E03.5060702@googlemail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 02/09/11 21:35, Paolo Castagna wrote:
> I saw the "pseudo paging via query analysis of ORDER BY / OFFSET / LIMIT
> access patterns" idea inhttps://github.com/afs/LD-Access  and it's great.

The LD-Access project, which is just some play area at the moment, is 
some ideas for caching SPARQL queries, particularly ones that 
linked-data-api systems tend to generate.

First - it needs a name.  Err ... "Luna".

The starting point is have an intercepting SPARQL endpoint for another 
endpoint.  The application uses the cache URL for SPARQL queries.

It's not supposed to be a big project - it might be a servlet SPARQL 
endpoint and/or yet-another query engine implementation.

1/ Same query.

The most obvious is repeating the same query.  It's sometimes surprising 
just how often a query is repeated, across users (e.g same starting 
point of an app), but even by the same user.  Having a close cache is 
noticeably faster than going to a remote endpoint.

HTTP caching also catches this, or it would. A GET with query string 
isn't cached by squid apparently.

2/ Convert formats

Given a proxy that is looking at the request, the format of the response 
can be converted between formats, so convert from SPARQL XML to SPARQL 
JSON for example.

3/ Paging.

The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with 
changes in OFFSET to get different slices of a result set happens in 
linked data apps (and others).

We've been optimizing these in ARQ using "top N" queries but LD-Access 
can offer facilities at a different granularity.  Catch that query, 
issue the full SELECT / ORDER BY query, cache the results.  Then you can 
slice the results as pages without going back to the server.

One side effect of this is paging without sorting, another is moving 
sorting away from the origin server.

Sorting is expensive but it's needed to guarantee stability of the 
result set being sliced into pages.  So issue the query as SELECT and 
either sort locally (you get to choose the resources available), to get 
the same sorted pageable results.  Or if ordering is only for stability, 
just remove the ORDER by and replace with a promise to slice from an 
unchanging result set.

4/ Add reliability/predictability.

Defend the app from bad data - always get the entire results back to 
check they will all be available before responding to the client.

Or add query timeouts.

Or fix formats if it isn't quite correct.

5/ Intermittent endpoints.

It's hard to run a public endpoint on the open web.  dbpedia is not 
always up, and if it's up, then it's busy because of other requests.

dbpedia has a (necessary) defensive query execution timeout - it is 
easier to get a query to run late at (Amercian) night than European 
afternoon.  Why not issue the queries for resources you want to track in 
a batch script and pick up the results during the day?  Doesn't work for 
all situations but it can be useful.

6/ Resource caching.

1-5 are about SPARQL queries, mainly SELECT.  What about caching data 
about resources (all triples with the same subject)?

Break up a DESCRIBE query into pattern and resources, issue the pattern, 
see what resources it will describe and only get ones not cached.  This 
might be a loss as it is a double round tripe.

6/ Not SPARQL at all.

This gets into a very different kind of server.  It caches information 
(and here "cache" may be "publish") as information about things named by 
URI, e.g. all triples with the same subject.

Access is plain GET ?subject=<uri> -- it's a key-value store or document 
store providing SPARQL.  It will scale; it can use any one of the NoSQL 
KeyValue stores out there.

Add secondary indexes - e.g. a Lucene index.  The index is simply a way 
to ask a question and get a list of URIs.  The URIs are accessed and all 
the RDF sent back to the requester.  How the index gets formed is not 
defined.

Or a geospatial index - get information about all things in a bounding box.

You can see this is like various NoSQL-ish things out there, and in teh 
spirit of OData/GData -- this is "RData".

Obviously a step towards the linked-data-api here - an open question to 
me is whether there is a set of operations that would help and LDA service.


Cache validation: a lot of data is being published; it changes very 
rarely and timeliness of cache validity does not matter.  Checking could 
be done asynchronously to the request, e.g. every night at some quiet 
time (well, quiet for the cache user) and be gentle on the remote endpoint.

(For Fuseki+TDB, I'd like to get to support for conditional GETs using 
transactions to generate a new eTag.)

	Andy