On 6 August 2013 15:12, Keith Freeman <firstname.lastname@example.org> wrote:
I've seen in several places the advice to use queries like to this page through lots of rows:
select id from mytable where token(id) > token(last_id)
But it's hard to find detailed information about how this works (at least that I can understand -- the description in the Cassandra manual is pretty brief).
One thing I'd like to know is if new rows are always guaranteed to have token(new_id) > token(ids-of-all-previous-rows)? E.g. if I have one process that adds rows to a table, and another that processes rows from the table, can the "processor" save the id of the last row processed and when he wakes up use:
select * from mytable where token(id) > token(last_processed_id)
to process only new rows? Will this always work to get only new rows?
No, unfortunately not. The tokens are generated by the partitioner - they are the hash of the row key. New tokens could be anywhere in the range of tokens so you can't use token ordering to find new rows.
The query you suggest works to page through all the data in your column family. Rows will be returned regardless of when they were added (as long as they were added before the query started). Finding rows that have been added since a certain time is hard in Cassandra since they are stored in token order. In general you have to read through all the data and work out from e.g. a date field if they should be treated as new.