cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Morton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4223) Non Unique Streaming session ID's
Date Mon, 07 May 2012 21:46:49 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270037#comment-13270037
] 

Aaron Morton commented on CASSANDRA-4223:
-----------------------------------------

Did some more testing. On the stormondemand.com machines it looks like nanoTime() is updated
every 10ms. On EC2 nodes nanoTime() was always unique. The test script is attached, examples
are below...

{code}
root@db6:~/aaron# java -classpath ./ NanoTest 100
nanoTime 460519702418991 occurred 5176 times
nanoTime 460519712419053 occurred 12090 times
nanoTime 460519722419115 occurred 22602 times
nanoTime 460519732419177 occurred 36154 times
nanoTime 460519742419239 occurred 36089 times
nanoTime 460519752419301 occurred 36866 times
nanoTime 460519762419363 occurred 36997 times
nanoTime 460519772419425 occurred 36763 times
nanoTime 460519782419487 occurred 36910 times
nanoTime 460519792419549 occurred 35481 times
Ran for 100 milliseconds, got 295128 duplicates and 11 uniques.
{code}

If it takes 10ms to get a unique value calling multiple times is out of the question. Will
put thinking hat back on.
                
> Non Unique Streaming session ID's
> ---------------------------------
>
>                 Key: CASSANDRA-4223
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4223
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.9
>         Environment: Ubuntu 10.04.2 LTS
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> "Bare metal" servers from https://www.stormondemand.com/servers/baremetal.html 
> The servers run on a custom hypervisor.
>  
>            Reporter: Aaron Morton
>            Assignee: Aaron Morton
>              Labels: datastax_qa
>         Attachments: fmm streaming bug.txt
>
>
> I have observed repair processes failing due to duplicate Streaming session ID's. In
this installation it is preventing rebalance from completing. I believe it has also prevented
repair from completing in the past. 
> The attached streaming-logs.txt file contains log messages and an explanation of what
was happening during a repair operation. it has the evidence for duplicate session ID's.
> The duplicate session id's were generated on the repairing node and sent to the streaming
node. The streaming source replaced the first session with the second which resulted in both
sessions failing when the first FILE_COMPLETE message was received. 
> The errors were:
> {code:java}
> DEBUG [MiscStage:1] 2012-05-03 21:40:33,997 StreamReplyVerbHandler.java (line 47) Received
StreamReply StreamReply(sessionId=26132848816442266, file='/var/lib/cassandra/data/FMM_Studio/PartsData-hc-1-Data.db',
action=FILE_FINISHED)
> ERROR [MiscStage:1] 2012-05-03 21:40:34,027 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[MiscStage:1,5,main]
> java.lang.IllegalStateException: target reports current file is /var/lib/cassandra/data/FMM_Studio/PartsData-hc-1-Data.db
but is null
>         at org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:195)
>         at org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:58)
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>         at java.lang.Thread.run(Unknown Source)
> {code}
> and
> {code:java}
> DEBUG [MiscStage:2] 2012-05-03 21:40:36,497 StreamReplyVerbHandler.java (line 47) Received
StreamReply StreamReply(sessionId=26132848816442266, file='/var/lib/cassandra/data/OpsCenter/rollups7200-hc-3-Data.db',
action=FILE_FINISHED)
> ERROR [MiscStage:2] 2012-05-03 21:40:36,497 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[MiscStage:2,5,main]
> java.lang.IllegalStateException: target reports current file is /var/lib/cassandra/data/OpsCenter/rollups7200-hc-3-Data.db
but is null
>         at org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:195)
>         at org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:58)
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>         at java.lang.Thread.run(Unknown Source)
> {code}
> I think this is because System.nanoTime() is used for the session ID when creating the
StreamInSession objects (driven from StorageService.requestRanges()) . 
> From the documentation (http://docs.oracle.com/javase/6/docs/api/java/lang/System.html#nanoTime())

> {quote}
> This method provides nanosecond precision, but not necessarily nanosecond accuracy. No
guarantees are made about how frequently values change. 
> {quote}
> Also some info here on clocks and timers https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks
> The hypervisor may be at fault here. But it seems like we cannot rely on successive calls
to nanoTime() to return different values. 
> To avoid message/interface changes on the StreamHeader it would be good to keep the session
ID a long. The simplest approach may be to make successive calls to nanoTime until the result
changes. We could fail if a certain number of milliseconds have passed. 
> Hashing the file names and ranges is also a possibility, but more involved. 
> (We may also want to drop latency times that are 0 nano seconds.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message