abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Berry <chriswbe...@gmail.com>
Subject Re: RFC-5005 best practices
Date Fri, 14 Dec 2007 22:07:05 GMT
I'll add a little bit more to what Bryon has said (we're building  
this "AtomStore"  together ;-)

1) We've mixed together a few "specs" . As Bryon said, we use some of  
the OpenSearch elements (<startIndex>, <totalResults>,  
<itemsPerPage>) and also some of "paging" spec elements; <link  
rel="next"/>  (e.g. as in the Abdera extensions)
2) But we've also had to add our own element <endIndex>,  because  
<startIndex> and <pageSize> does not yield enough information to  
request the next page accurately. E.g. one cannot assume that  
startIndex + pageSize will yield endIndex.
3) <totalResults> is optional, as it is not inexpensive to produce  
(another DB query) and is relatively useless info. (and none of our  
Clients wanted it... ;-)
4) "start-index" as a URL param is exclusive (i.e the DB query is  
"index > start-index"). This way Clients can just use <endIndex> from  
the previous page (or, better, just use the "next" link)
5) This is not a general cursoring mechanism. If you first send a  
query with "?start-index=1&max-results=10" and then send another  
query with "?start-index=11&max-results=10",  the service cannot  
guarantee that the results are equivalent to "?start-index=1&max- 
results=20", because insertions and deletions could have taken place  
in between the two queries.
6) Note that start-index should NOT be interpreted in any way. It is  
used to order entries, but you should never assume its value.
7) We do not provide a <link rel="previous" ..> (as suggested by the  
paging "spec"). We do not have any practical need for this.

So, in synopsis, we use;
openSearch:totalResults
        The total number of search results for the query (not  
necessarily all present in the results feed).
openSearch:startIndex
        The index of the first result.
openSearch:itemsPerPage
        The maximum number of items that appear on one page.

foo:endIndex
         The index of the final result.  Note; the endIndex returned  
by a Feed can be used as the start-index for a subsequent Feed page  
request.

<link rel="next" type="application/atom+xml" href="..."/>
     Specifies the URI of the next chunk of this query result set, if  
it is chunked. The client MUST use the next link when accessing the  
next page of results. This link contains the "start-index"  and "max- 
results" in the href URL

BTW: Note: time is definitely not accurate enough here for paging.  
Several items may get the same time-stamp (due to DB Date accuracy,  
multi-threading, multi-servers, ...)  And HTTP wants requests against  
"lastModified" to be inclusive instead of exclusive, which is not  
really what you're after in this case... You may see something you've  
already seen (and has not actually changed).

Cheers,
-- Chris 

On Dec 14, 2007, at 3:30 PM, Bryon Jacob wrote:

> I haven't read RFC-5005, but we're building a data service that  
> sounds somewhat similar to what you're doing, so I'll chime in on  
> pagination...
>
> Along the same lines as what David said, what we've done to address  
> pagination is a solution based on the way Google has integrated  
> OpenSearch into its querying APIs for GData.
>
> implementation-wise, the trick is to add a monotonically increasing  
> sequence number to every entry that we store, which we then use to  
> get very stable pagination when we pull feeds.
>
> here's how it works:
> -  as each entry is written to the store, it gets the next sequence  
> number to be assigned.
> -  if an entry is later updated, it's previous sequence number is  
> overwritten with the new next number.
> -  we've modified our feed urls to accept an optional "start-index"  
> request parameter, which is a lower bound on sequence numbers to  
> return.
> -  when there are more results than the requested page-size, we add  
> a "next" link to the feed we return, which is a feed with the same  
> URI as the feed requested, but with the start-index set to one  
> higher than the highest sequence number on this page.
>
> what this guarantees is:
> -  you will never miss any data because of data changing while you  
> read the feed (very important to us, since, like you, we are using  
> this as a back-end data service to keep data synced across systems)
> -  you will only ever see the same entry occur twice during a feed  
> pull if it was in fact updated during the course of your paginating  
> through the data (in which case, you would want to get it twice to  
> be maximally in sync!)
>
> here's Google's reference on the subject ==> http://code.google.com/ 
> apis/gdata/reference.html
>
>
> In case my explanation above is a bit fuzzy, I'll give a simple  
> example of how this all works:
>
> let's say we start off with a totally empty feed store of "colors",  
> and someone POSTS the entries RED, BLUE, GREEN, and YELLOW to the  
> feed - our DB table that stores entry meta data would look like:
>
> id			sequence_num
> RED		1
> BLUE		2
> GREEN		3
> YELLOW	4
>
> and the query we use to get the page of results, in pseudo-xml is  
> something like:
>   	SELECT TOP page_size id
> 	FROM meta_data
> 	WHERE "it matches my feed request URI"
> 	AND sequence_num >= start_index
> 	ORDER BY sequence_num
>
> so, if someone came along to pull the feed http://my.server/data/ 
> colors?page-size=3, they would get (in pseudo-feed-xml-ish):
> 	<feed>
> 		<entry id="RED">...</entry>
> 		<entry id="BLUE">...</entry>
> 		<entry id="GREEN">...</entry>
> 		<link rel="next" href="http://my.server/data/colors?page- 
> size=3&start-index=4"/>
> 	</feed>
>
> then, if they followed the link http://my.server/data/colors?page- 
> size=3&start-index=4, they would get:
> 	<feed>
> 		<entry id="YELLOW">...</entry>
> 	</feed>
>
> that's it for the simplest case --  if, however, in the time  
> between when the user pulled the first page of the feed and the  
> second, someone had PUT an update to GREEN, and POSTED a new entry  
> PURPLE, the DB table would now look like:
>
> id			sequence_num
> RED		1
> BLUE		2
> GREEN		5					-- GREEN is updated to 5 by the PUT
> YELLOW	4
> PURPLE	6					-- PURPLE is then inserted with sequence_num 6
>
> and when the user follows the link http://my.server/data/colors? 
> page-size=3&start-index=4, they would instead get:
> 	<feed>
> 		<entry id="YELLOW">...</entry>
> 		<entry id="GREEN">...</entry>
> 		<entry id="PURPLE">...</entry>
> 	</feed>
>
> note that they have now gotten GREEN twice during the "same" feed  
> pull, but that's as it should be, because GREEN changed between  
> page 1 and page 2 of the feed.
>
> it's a nice side effect of this solution that it works just fine no  
> matter how long you, as the client, take to process a page of  
> results, or how long you choose to wait between pages - whenever  
> you request the next page, it is guaranteed to simply be "the next  
> N significant changes after the page I previously pulled.", where N  
> is your page size.
>
> one more thing that's maybe worth noting is this - notice I said  
> "significant" changes -- this solution hides from you any changes  
> that were superseded before you got around to them -- which is  
> almost certainly what you want, but it's worth being aware of.  if,  
> after our client had pulled the first page of results back, the  
> following things had happened:
> 	PUT GREEN
> 	POST PURPLE
> 	PUT GREEN (again)
>
> the DB would look like:
>
> id			sequence_num
> RED		1
> BLUE		2
> GREEN		7				-- GREEN is updated to 5 by the first PUT, then to 7 by  
> the SECOND
> YELLOW	4
> PURPLE	6				-- PURPLE is inserted with sequence_num 6
>
> and when the user follows the link http://my.server/data/colors? 
> page-size=3&start-index=4, they would instead get:
> 	<feed>
> 		<entry id="YELLOW">...</entry>
> 		<entry id="PURPLE">...</entry>
> 		<entry id="GREEN">...</entry>
> 	</feed>
>
> note that on page one, we saw the inital revision of GREEN, and on  
> page 2 we see the THIRD revision of GREEN -- we never saw the  
> second.  again, unless you're doing something pretty unusual with  
> Atom feeds, you probably don't care, because if you HAD gotten the  
> second, you would have just overwritten it with the third - but  
> it's worth being aware of what's really happening.
>
> Hope this helps - if you have any more questions (or critiques!)  
> about this strategy, we'd love to hear them.  thanks!
>
> - Bryon
>
>
>
> On Dec 14, 2007, at 2:40 PM, David Calavera wrote:
>
>> why don't you use the openSearch format?
>>
>> On Dec 14, 2007 8:57 PM, Remy Gendron <remy.gendron@arrova.ca> wrote:
>>
>>> My APP server isn't used in the context of a standard feed  
>>> provider. It
>>> will
>>> be more of a web interface to a backend data server. We are  
>>> leveraging
>>> Atom/APP/REST as a generic data provider interface for our web  
>>> services.
>>>
>>> That's why your suggestion, although pretty good for feeds, is not
>>> applicable here. I really want to chunk large datasets/search  
>>> results.
>>>
>>> I am also willing to live with some infrequent inconsistencies while
>>> scanning the pages following concurrent create/delete ops.
>>>
>>> My question was really about naming conventions when providing  
>>> the page
>>> size
>>> and page index as URL parameters.
>>>
>>> Thanks again,
>>>
>>> - Remy
>>>
>>>
>>> -----Original Message-----
>>> From: James M Snell [mailto:jasnell@gmail.com]
>>> Sent: 14 December 2007 13:57
>>> To: abdera-user@incubator.apache.org
>>> Subject: Re: RFC-5005 best practices
>>>
>>> I've implemented paging a number of times.  The easiest approach has
>>> always been to use page and pagesize.  Doing so, however, has it's
>>> disadvantages.  For one, the pages are unstable -- that is, as new
>>> entries are added to the collection, the entries slide through  
>>> the pages
>>> making it difficult for a client to completely and consistently  
>>> sync up
>>> the changes.  An alternative approach would be to based paging on  
>>> date
>>> ranges, each each page could represent all entries modified within a
>>> given period of time.  Such pages will generally be much less  
>>> volatile
>>> over time.
>>>
>>> - James
>>>
>>> Remy Gendron wrote:
>>>> Hello all,
>>>>
>>>>
>>>>
>>>> I'm implementing paging in my Abdera server. FeedPagingHelper  
>>>> covers the
>>>> spec…
>>>>
>>>>
>>>>
>>>> But do you recommend any best practices on passing in the  
>>>> parameters?
>>>> (pageSize, pageIndex)
>>>>
>>>>
>>>>
>>>> I haven't seen any recommendations from Abdera… Do you recommend
>>>> Google's GData query extensions?
>>>>
>>>>
>>>>
>>>> Thanks a lot for the great implementation!
>>>>
>>>>
>>>>
>>>> Rémy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> remy.gendron@arrova.ca <mailto:remy.gendron@arrova.ca>
>>>>
>>>> 418 809-8585
>>>>
>>>> http://www.arrova.ca <http://www.arrova.ca/>
>>>>
>>>>
>>>>
>>>>
>>>> No virus found in this outgoing message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
>>>> 2007.12.14 11:29
>>>>
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
>>> 2007.12.14
>>> 11:29
>>>
>>>
>>> No virus found in this outgoing message.
>>> Checked by AVG Free Edition.
>>> Version: 7.5.503 / Virus Database: 269.17.2/1184 - Release Date:
>>> 2007.12.14
>>> 11:29
>>>
>>>
>>>
>>
>>
>> -- 
>> David Calavera
>> http://www.thinkincode.net
>

S'all good  ---   chriswberry at gmail dot com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message