cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10938) test_bulk_round_trip_blogposts is failing occasionally
Date Mon, 04 Jan 2016 12:20:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080872#comment-15080872
] 

Stefania edited comment on CASSANDRA-10938 at 1/4/16 12:19 PM:
---------------------------------------------------------------

The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best information
to understand the problem: about 15 shared pool worker threads are busy copying the {{NonBlockingHashMap}}
that we use to store the query states in {{ServerConnection}}. This consumes 99% of the CPU
on the machine (note that I lowered the priority of the process when I recorded that file).

We store one entry per stream id and we never clean this map but this is not the issue. When
inserting data with cassandra-stress, we use up to 33k stream ids whilst when inserting data
with COPY FROM the python driver is careful to reuse stream ids and we only use around 300
of them. So the map should not be resized as much and yet the problem occurs with COPY FROM
(approximately once every twenty times) and never with cassandra-stress. The difference between
the two is probably that in COPY FROM we have more concurrent requests, hence a higher concurrency
level on the map.

Of all hot threads in the flight recorder file, only one is doing a {{putIfAbsent}} whist
the other ones are simply accessing a value via a {{get}}. However the map is designed so
that all threads help with the copy and this is what's happening here. I suspect a bug that
prevents threads from making progress and keeps them spinning.

We are currently using the latest available version of {{NonBlockingHashMap}}, version 1.0.6,
from [this repository|https://github.com/boundary/high-scale-lib].

We have a number of options:

- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this specific case). 



was (Author: stefania):
The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best information
to understand the problem: about 15 shared pool worker threads are busy copying the {{NonBlockingHashMap}}
that we use to store the query states in {{ServerConnection}}. This consumes 99% of the CPU
on the machine (note that I lowered the priority of the process when I recorded that file).

We store one entry per stream id and we never clean this map but this is not the issue. When
inserting data with cassandra-stress, we use up to 33k stream ids whilst when inserting data
with COPY FROM the python driver is careful to reuse stream ids and we only use around 300
of them. So the map should not be resized as much and yet the problem occurs with COPY FROM
and not with cassandra-stress. The difference between the two is probably that in COPY FROM
we have may more concurrent requests, hence a higher concurrency level on the map.

Of all hot threads in the flight recorder file, only one is doing a {{putIfAbsent}} whist
the other ones are simply accessing a value via a {{get}}. However the map is designed so
that all threads help with the copy and this is what's happening here. I suspect a bug that
prevents threads from making progress and keeps them spinning.

We are currently using the latest available version of {{NonBlockingHashMap}}, version 1.0.6,
from [this repository|https://github.com/boundary/high-scale-lib].

We have a number of options:

- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this specific case). 


> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
>                 Key: CASSANDRA-10938
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x
>
>         Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png, node1_debug.log,
node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message