Have you considered creating a second column family which acts
as an index for
the original column family? Have the record number as
the column name, and the
value as the identifier (primary key) of the original data,
and do a
start='00051235', finish='', limit=100)
columns=<list of returned column values>)
This way, only 100 columns are returned on the first
call, and 100 columns (or super columns)
on the second. You have two calls instead of one, but it
should be faster because
much less data is transferred (and the latency can be hidden
yes, I read the PDF to the finish. Twice. As I wrote, my application is
not accessed by users, it's accessed by other applications that can access
So when some application wants to get page 51235 (so skip
is 5123500, limit is 100) then I have to:
1) GetSlice(from: "", to: "", limit: 5123500)
2) Read only the last column name.
3) GetSlice(from: point2value, to: "", limit: 100)
The problem is in 1) - Cassandra has to read 5123500 columns,
serialize them, send them using Thrift protocol and deserialize them.
Finally, I throw 5123499 of columns away. It doesn't seem to be very
So I'm looking for another solution for this scenario. I know the right
way for pagination in Cassandra and I'm using them if I can...
So if this kind of pagination cannot be added to standard Cassandra
Thrift API then I should create some separate Thrift API that will handle my
scenario (and avoid high network traffic). Am I right?
2010/9/5 Mike Peters <email@example.com>
Did you read the PDF
Stu sent over, start to finish? There are several different approaches
With Cassandra, what we found works best for
* Keep a separate 'total_records' count and
increment/decrement it on every insert/delete
* When getting slices, pass
'last seen' as the 'from' and keep the 'to' empty. Pass the number of
records you want to show per page in the 'count'.
* Avoid letting user
skip to page X, using Next/Prev/First/Last only (same way GMail does it)
Michal Augustın wrote:
I know that "Prev/Next" is good solution for web
applications. But when I want to access data from another application or
when I want to access pages randomly...
I don't know the internal structure of memtables etc., so I don't
know if columns in row are indexable. If now, then I just want to
transfer my workaround to server (to avoid huge network
2010/9/5 Stu Hood <firstname.lastname@example.org>
Cassandra supports the recommended approach from: http://www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf
large numbers of items, skip + limit is extremely inefficent.
From: "Michal Augustın" <email@example.com
September 5, 2010 5:39am
Subject: skip + limit
support in GetSlice
probably this is feature
request. Simply, I would like to have support for
(skip + limit) in GetSlice Thrift method. Is this
feature on the road
Now, I have to perform GetSlice call, that starts on "" and
"limit" is set
to "skip" value. Then I read the last column name
returned and subsequently
perform the final GetSlice call - I use the
last column name as "start" and
set "limit" to "limit"
This workaround is not very efficient when I need to skip
a lot of columns
(so "skip" is high) - then a lot of data must be
transferred via network. So
I think that support for Skip in GetSlice
would be very useful (to avoid
high network traffic).
implementation could be very straightforward (same as the workaround)
maybe it could be more efficient - I think that whole row (so all
must fit into memory so if we have all columns in