we did some testing and found that doing range queries is much
quicker then querying data regularly. I am guessing that a range
query request is going to seek much more efficiently on disk.
This is where the idea of sorting our tokens comes in. We have a
batch request of say 1000 items and instead of doing a multiget
from cassandra which involves a lot of random I/O seeks, we would
like to have a way to seek for the range. It doesn't actually
matter if the range is slightly biggern then the amount of items
we would like to retrieve as the time we loose filtering unneeded
items in code is quicker then doing a multiget for 1000 items in
the first place.
Is there a way for basing token ranges somewhat on a certain value
in our schema? Say every row has a value A and B. While A is just
a random identifier and we can't really rely on what this will be,
all our queries operate on a way that B is going to be the same
value for all items in the query. If we had the token range being
random however with regards that the random values are generated
based on the B value and therefore all items with B are close
together in range and therefore optimized for range queries rather
then gets, that could possibly speed up read performance
On 21/10/13 16:58, Edward Capriolo wrote:
I am not sure what you are working on will have an effect. You can
not actually control the way the operating system seeks data on
disk. The io scheduling is done outside cassandra. You can try to
write the code in an optimistic way taking phyical hardware into
account, but then you have to consider there are n concurrent
requests on the io system.
On Friday, October 18, 2013, Viktor Jevdokimov <Viktor.Jevdokimov@adform.com>
> Read latency depends on many factors, don't forget "physics".
> If it meets your requirements, it is good.
> -----Original Message-----
> From: Artur Kronenberg [mailto:email@example.com]
> Sent: Friday, October 18, 2013 1:03 PM
> To: firstname.lastname@example.org
> Subject: Re: Sorting keys for batch reads to minimize seeks
> Thanks for your reply. Our latency currently is 23.618ms.
However I simply read that off one node just now while it wasn't
under a load test. I am going to be able to get a better number
after the next test run.
> What is a good value for read latency?
> On 18/10/13 08:31, Viktor Jevdokimov wrote:
>> The only thing you may win - avoid unnecessary network
>> - request sorted keys (by token) from appropriate replica
with ConsistencyLevel.ONE and "dynamic_snitch: false".
>> - nodes has the same load
>> - replica not doing GC, and GC pauses are much higher
than internode communication.
>> For multiple keys request C* will do multiple single key
reads, except for range scan requests, where only starting key and
batch size is used in request.
>> Consider multiple key request as a slow request by
design, try to model your data for low latency single key
>> So, what latencies do you want to achieve?
>> Best regards / Pagarbiai
>> Viktor Jevdokimov
>> Senior Developer
>> Email: Viktor.Jevdokimov@adform.com
>> Phone: +370 5 212 3063
>> Fax: +370 5 261 0453
>> J. Jasinskio 16C,
>> LT-03163 Vilnius,
>> Disclaimer: The information contained in this message and
>> is intended solely for the attention and use of the named
>> and may be confidential. If you are not the intended
>> are reminded that the information remains the property of
>> You must not use, disclose, distribute, copy, print or
rely on this
>> e-mail. If you have received this message in error,
please contact the
>> sender immediately and irrevocably delete this message
>> copies.-----Original Message-----
>> From: Artur Kronenberg [mailto:email@example.com]
>> Sent: Thursday, October 17, 2013 7:40 PM
>> To: firstname.lastname@example.org
>> Subject: Sorting keys for batch reads to minimize seeks
>> I am looking to somehow increase read performance on
cassandra. We are still playing with configurations but I was
thinking if there would be solutions in software that might help
us speed up our read performance.
>> E.g. one idea, not sure how sane that is, was to sort
read-batches by row-keys before submitting them to cassandra. The
idea is that row-keys should be closer together on the physical
disk and therefor this may minimize the amount of random seeks we
have to do when querying say 1000 entries from cassandra. Does
that make any sense?
>> Is there anything else that we can do in software to
improve performance? Like specific batch sizes for reads? We are
using the astyanax library to access cassandra.