cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Timeout error in fetching million rows as results using clustering keys
Date Wed, 18 Mar 2015 12:19:47 GMT
Cassandra can certainly handle millions and even billions of rows, but...
it is a very clear anti-pattern to design a single query to return more
than a relatively small number of rows except through paging. How small?
Low hundreds is probably a reasonable limit. It is also an anti-pattern to
filter or analyze a large number of rows in a single query - that's why
there are so many crazy restrictions and the requirement to use ALLOW
FILTERING - to reinforce that Cassandra is designed for short and
performant queries, not large-scale retrieval of a large number of rows. As
a general rule, the user of ALLOW FILTERING is an anti-pattern and a yellow
flag that you are doing something wrong.

As a minor point, check your partition key - you should try to "bucket"
rows that will tend to be accessed together so that they have locality so
that they can be fetched together.

Rather than using a raw x and y coordinate range, consider indexing by a
"chunk" number and then you can query by chunk number for direct access to
the partition and row key, without the need for inequality filtering.


-- Jack Krupansky

On Wed, Mar 18, 2015 at 3:22 AM, Mehak Mehta <memehta@cs.stonybrook.edu>
wrote:

> Hi Jens,
>
> I have tried with fetch size of 10000 still its not giving any results.
> My expectations were that Cassandra can handle a million rows easily.
>
> Is there any mistake in the way I am defining the keys or querying them.
>
> Thanks
> Mehak
>
> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <jens.rantil@tink.se> wrote:
>
>> Hi,
>>
>> Try setting fetchsize before querying. Assuming you don't set it too
>> high, and you don't have too many tombstones, that should do it.
>>
>> Cheers,
>> Jens
>>
>> –
>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <memehta@cs.stonybrook.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> I have requirement to fetch million row as result of my query which is
>>> giving timeout errors.
>>> I am fetching results by selecting clustering columns, then why the
>>> queries are taking so long. I can change the timeout settings but I need
>>> the data to fetched faster as per my requirement.
>>>
>>> My table definition is:
>>> *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar,
>>> analysis_execution_uuid uuid, x  double, y double, loc varchar, w double, h
>>> double, normalized varchar, type varchar, filehost varchar, filename
>>> varchar, image_uuid uuid, image_uri varchar, image_caseid varchar,
>>> image_mpp_x double, image_mpp_y double, image_width double, image_height
>>> double, objective double, cancer_type varchar,  Area float, submit_date
>>> timestamp, points list<double>,  PRIMARY KEY ((image_caseid),Area,uuid));*
>>>
>>> Here each row is uniquely identified on the basis of unique uuid. But
>>> since my data is generally queried based upon *image_caseid *I have
>>> made it partition key.
>>> I am currently using Java Datastax api to fetch the results. But the
>>> query is taking a lot of time resulting in timeout errors:
>>>
>>>  Exception in thread "main"
>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>>> server response))
>>>  at
>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>  at
>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>  at
>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>  at
>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>  at TestQuery.main(TestQuery.java:35)
>>> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
>>> All host(s) tried for query failed (tried: localhost/127.0.0.1:9042
>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>>> server response))
>>>  at
>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>  at
>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>  at java.lang.Thread.run(Thread.java:744)
>>>
>>> Also when I try the same query on console even while using limit of 2000
>>> rows:
>>>
>>> cqlsh:images> select count(*) from results where
>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and Area>20 limit 2000;
>>> errors={}, last_host=127.0.0.1
>>>
>>> Thanks and Regards,
>>> Mehak
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message