incubator-jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <>
Subject Re: LD-Access
Date Thu, 08 Sep 2011 22:51:29 GMT
Andy Seaborne wrote:
> On 02/09/11 21:35, Paolo Castagna wrote:
>> I saw the "pseudo paging via query analysis of ORDER BY / OFFSET / LIMIT
>> access patterns" idea in  and it's great.
> The LD-Access project, which is just some play area at the moment, is 
> some ideas for caching SPARQL queries, particularly ones that 
> linked-data-api systems tend to generate.
> First - it needs a name.  Err ... "Luna".
> The starting point is have an intercepting SPARQL endpoint for another 
> endpoint.  The application uses the cache URL for SPARQL queries.
> It's not supposed to be a big project - it might be a servlet SPARQL 
> endpoint and/or yet-another query engine implementation.
> 1/ Same query.
> The most obvious is repeating the same query.  It's sometimes surprising 
> just how often a query is repeated, across users (e.g same starting 
> point of an app), but even by the same user.  

Indeed, people running SPARQL endpoints can easily find evidence for that in 
their query logs. Another useful thing to look at is the percentage of writes 
vs. read requests. Even if this, obviously, depends on your application.
The smaller the percentage of writes you have, the more beneficial is a cache 
layer in front of your SPARQL endpoint (and the less important cache 
invalidation becomes). All obvious stuff, but it's good to remember it.

> Having a close cache is noticeably faster than going to a remote endpoint.

Yep, in particular with large result sets (where transfer time can have a 
significant impact).

> HTTP caching also catches this, or it would. A GET with query string 
> isn't cached by squid apparently.
> 2/ Convert formats
> Given a proxy that is looking at the request, the format of the response 
> can be converted between formats, so convert from SPARQL XML to SPARQL 
> JSON for example.
> 3/ Paging.
> The idiom of a sequence of SELECT / ORDER BY / OFFSET / LIMIT calls with 
> changes in OFFSET to get different slices of a result set happens in 
> linked data apps (and others).
> We've been optimizing these in ARQ using "top N" queries but LD-Access 
> can offer facilities at a different granularity.  Catch that query, 
> issue the full SELECT / ORDER BY query, cache the results.  Then you can 
> slice the results as pages without going back to the server.
> One side effect of this is paging without sorting, another is moving 
> sorting away from the origin server.
> Sorting is expensive but it's needed to guarantee stability of the 
> result set being sliced into pages.  So issue the query as SELECT and 
> either sort locally (you get to choose the resources available), to get 
> the same sorted pageable results.  Or if ordering is only for stability, 
> just remove the ORDER by and replace with a promise to slice from an 
> unchanging result set.


> 4/ Add reliability/predictability.
> Defend the app from bad data - always get the entire results back to 
> check they will all be available before responding to the client.
> Or add query timeouts.
> Or fix formats if it isn't quite correct.

This would be really good.

On thing with a SPARQL endpoint and HTTP is that you do not know in advance how 
big your response will be and you need to send the HTTP response code before the 
content. If you want to stream back query results, you send a 200 OK status code 
back to the client and then you start streaming... at that point something bad 
happens and your user gets a truncated answer back (which is really bad).

If you get stuff out from a cache you know how big it is and it's less likely 
you'll have problems. You can send back a content-length or even a checksum and 
users/clients could verify that.

> 5/ Intermittent endpoints.
> It's hard to run a public endpoint on the open web.

Indeed. ;-)

Timing out queries helps... and we now have this feature available.

If someone has good ideas on how to estimate the cost of SPARQL queries before 
even running it, we could offer the feature to refuse to run too complex 
queries. I know, it's not nice for your users... but some queries deserve this 
treatment. This would be something really useful for those running public SPARQL 

> dbpedia is not always up, and if it's up, then it's busy because of other requests.

Here we go, why you need replication to improve availability. ;-)

> dbpedia has a (necessary) defensive query execution timeout - it is 
> easier to get a query to run late at (Amercian) night than European 
> afternoon.  Why not issue the queries for resources you want to track in 
> a batch script and pick up the results during the day?  Doesn't work for 
> all situations but it can be useful.
> 6/ Resource caching.
> 1-5 are about SPARQL queries, mainly SELECT.  What about caching data 
> about resources (all triples with the same subject)?
> Break up a DESCRIBE query into pattern and resources, issue the pattern, 
> see what resources it will describe and only get ones not cached.  This 
> might be a loss as it is a double round tripe.

It would be good to try and see if it helps... it seems a good idea to me.

> 6/ Not SPARQL at all.
> This gets into a very different kind of server.  It caches information 
> (and here "cache" may be "publish") as information about things named by 
> URI, e.g. all triples with the same subject.
> Access is plain GET ?subject=<uri> -- it's a key-value store or document 
> store providing SPARQL.  It will scale; it can use any one of the NoSQL 
> KeyValue stores out there.

Indeed, it would be good if LD-Access would make it easier to plug-in different 
key/value stores to do the caching with an in-memory or on-disk implementation.
But, it would be good to have memcached or Redis support, optionally.
This way, you could have multiple machines in front of your SPARQL endpoints 
doing caching and the data would be cached and shared between these machines 
rather than caching the same thing multiple times on different machines.
Using memcached or Redis from Java is trivial:

> Add secondary indexes - e.g. a Lucene index.  The index is simply a way 
> to ask a question and get a list of URIs.  The URIs are accessed and all 
> the RDF sent back to the requester.  How the index gets formed is not 
> defined.


> Or a geospatial index - get information about all things in a bounding box.


> You can see this is like various NoSQL-ish things out there, and in teh 
> spirit of OData/GData -- this is "RData".
> Obviously a step towards the linked-data-api here - an open question to 
> me is whether there is a set of operations that would help and LDA service.
> Cache validation: a lot of data is being published; it changes very 
> rarely and timeliness of cache validity does not matter.  Checking could 
> be done asynchronously to the request, e.g. every night at some quiet 
> time (well, quiet for the cache user) and be gentle on the remote endpoint.
> (For Fuseki+TDB, I'd like to get to support for conditional GETs using 
> transactions to generate a new eTag.)

+1 for conditional GETs via eTags.

When will is be ready for people to use?

I'd love to help, but Scala still scares me! ;-)


>     Andy

View raw message