cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Akhtar <ali.rac...@gmail.com>
Subject Re: Timeout error in fetching million rows as results using clustering keys
Date Wed, 18 Mar 2015 09:53:18 GMT
Yeah, it may be that the process is being limited by swap. This page:

https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42

Lines 42 - 48 list a few settings that you could try out for increasing /
reducing the memory limits (assuming you're on linux).

Also, are you using an SSD? If so make sure the IO scheduler is noop or
deadline .

On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta <memehta@cs.stonybrook.edu>
wrote:

> Currently Cassandra java process is taking 1% of cpu (total 8% is being
> used) and 14.3% memory (out of total 4G memory).
> As you can see there is not much load from other processes.
>
> Should I try changing default parameters of memory in Cassandra settings.
>
> On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <ali.rac200@gmail.com> wrote:
>
>> What's your memory / CPU usage at? And how much ram + cpu do you have on
>> this server?
>>
>>
>>
>> On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <memehta@cs.stonybrook.edu>
>> wrote:
>>
>>> Currently there is only single node which I am calling directly with
>>> around 150000 rows. Full data will be in around billions per node.
>>> The code is working only for size 100/200. Also the consecutive fetching
>>> is taking around 5-10 secs.
>>>
>>> I have a parallel script which is inserting the data while I am reading
>>> it. When I stopped the script it worked for 500/1000 but not more than
>>> that.
>>>
>>>
>>>
>>> On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac200@gmail.com>
>>> wrote:
>>>
>>>>  If even 500-1000 isn't working, then your cassandra node might not be
>>>> up.
>>>>
>>>> 1) Try running nodetool status from shell on your cassandra server,
>>>> make sure the nodes are up.
>>>>
>>>> 2) Are you calling this on the same server where cassandra is running?
>>>> Its trying to connect to localhost . If you're running it on a different
>>>> server, try passing in the direct ip of your cassandra server.
>>>>
>>>> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <memehta@cs.stonybrook.edu
>>>> > wrote:
>>>>
>>>>> Data won't change much but queries will be different.
>>>>> I am not working on the rendering tool myself so I don't know much
>>>>> details about it.
>>>>>
>>>>> Also as suggested by you I tried to fetch data in size of 500 or 1000
>>>>> with java driver auto pagination.
>>>>> It fails when the number of records are high (around 100000) with
>>>>> following error:
>>>>>
>>>>> Exception in thread "main"
>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting
for
>>>>> server response))
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac200@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> How often does the data change?
>>>>>>
>>>>>> I would still recommend a caching of some kind, but without knowing
>>>>>> more details (how often the data is changing, what you're doing with
the 1m
>>>>>> rows after getting them, etc) I can't recommend a solution.
>>>>>>
>>>>>> I did see your other thread. I would also vote for elasticsearch
/
>>>>>> solr , they are more suited for the kind of analytics you seem to
be doing.
>>>>>> Cassandra is more for storing data, it isn't all that great for complex
>>>>>> queries / analytics.
>>>>>>
>>>>>> If you want to stick to cassandra, you might have better luck if
you
>>>>>> made your range columns part of the primary key, so something like
PRIMARY
>>>>>> KEY(caseId, x, y)
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <
>>>>>> memehta@cs.stonybrook.edu> wrote:
>>>>>>
>>>>>>> The rendering tool renders a portion a very large image. It may
>>>>>>> fetch different data each time from billions of rows.
>>>>>>> So I don't think I can cache such large results. Since same results
>>>>>>> will rarely fetched again.
>>>>>>>
>>>>>>> Also do you know how I can do 2d range queries using Cassandra.
Some
>>>>>>> other users suggested me using Solr.
>>>>>>> But is there any way I can achieve that without using any other
>>>>>>> technology.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <ali.rac200@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry, meant to say "that way when you have to render, you
can
>>>>>>>> just display the latest cache."
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <ali.rac200@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would probably do this in a background thread and cache
the
>>>>>>>>> results, that way when you have to render, you can just
cache the latest
>>>>>>>>> results.
>>>>>>>>>
>>>>>>>>> I don't know why Cassandra can't seem to be able to fetch
large
>>>>>>>>> batch sizes, I've also run into these timeouts but reducing
the batch size
>>>>>>>>> to 2k seemed to work for me.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>>>>> memehta@cs.stonybrook.edu> wrote:
>>>>>>>>>
>>>>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>>>>> So efficiency of pulling this data matters a lot.
It should be
>>>>>>>>>> fetched within a minute.
>>>>>>>>>> Is there a way to achieve such efficiency
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <ali.rac200@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Perhaps just fetch them in batches of 1000 or
2000? For 1m rows,
>>>>>>>>>>> it seems like the difference would only be a
few minutes. Do you have to do
>>>>>>>>>>> this all the time, or only once in a while?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta
<
>>>>>>>>>>> memehta@cs.stonybrook.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar
<
>>>>>>>>>>>> ali.rac200@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Have you tried a smaller fetch size,
such as 5k - 2k ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak
Mehta <
>>>>>>>>>>>>> memehta@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have tried with fetch size of 10000
still its not giving
>>>>>>>>>>>>>> any results.
>>>>>>>>>>>>>> My expectations were that Cassandra
can handle a million rows
>>>>>>>>>>>>>> easily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any mistake in the way I
am defining the keys or
>>>>>>>>>>>>>> querying them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM,
Jens Rantil <
>>>>>>>>>>>>>> jens.rantil@tink.se> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Try setting fetchsize before
querying. Assuming you don't
>>>>>>>>>>>>>>> set it too high, and you don't
have too many tombstones, that should do it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Jens
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58
AM, Mehak Mehta <
>>>>>>>>>>>>>>> memehta@cs.stonybrook.edu>
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have requirement to fetch
million row as result of my
>>>>>>>>>>>>>>>> query which is giving timeout
errors.
>>>>>>>>>>>>>>>> I am fetching results by
selecting clustering columns, then
>>>>>>>>>>>>>>>> why the queries are taking
so long. I can change the timeout settings but I
>>>>>>>>>>>>>>>> need the data to fetched
faster as per my requirement.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>>>>> *CREATE TABLE images.results
(uuid uuid,
>>>>>>>>>>>>>>>> analysis_execution_id varchar,
analysis_execution_uuid uuid, x  double, y
>>>>>>>>>>>>>>>> double, loc varchar, w double,
h double, normalized varchar, type varchar,
>>>>>>>>>>>>>>>> filehost varchar, filename
varchar, image_uuid uuid, image_uri varchar,
>>>>>>>>>>>>>>>> image_caseid varchar, image_mpp_x
double, image_mpp_y double, image_width
>>>>>>>>>>>>>>>> double, image_height double,
objective double, cancer_type varchar,  Area
>>>>>>>>>>>>>>>> float, submit_date timestamp,
points list<double>,  PRIMARY KEY
>>>>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here each row is uniquely
identified on the basis of unique
>>>>>>>>>>>>>>>> uuid. But since my data is
generally queried based upon *image_caseid
>>>>>>>>>>>>>>>> *I have made it partition
key.
>>>>>>>>>>>>>>>> I am currently using Java
Datastax api to fetch the
>>>>>>>>>>>>>>>> results. But the query is
taking a lot of time resulting in timeout errors:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s)
>>>>>>>>>>>>>>>> tried for query failed (tried:
localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException:
Timed out waiting for
>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s)
>>>>>>>>>>>>>>>> tried for query failed (tried:
localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException:
Timed out waiting for
>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also when I try the same
query on console even while using
>>>>>>>>>>>>>>>> limit of 2000 rows:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> cqlsh:images> select count(*)
from results where
>>>>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1'
and Area<100 and Area>20 limit 2000;
>>>>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message