cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Forsberg <forsb...@opera.com>
Subject EOFException in bulkloader, then IllegalStateException
Date Mon, 27 Jan 2014 11:56:57 GMT
Hi!

I'm bulkloading from Hadoop to Cassandra. Currently in the process of 
moving to new hardware for both Hadoop and Cassandra, and while 
testrunning bulkload, I see the following error:

Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" 
java.lang.RuntimeException: java.io.EOFException at 
com.google.common.base.Throwables.propagate(Throwables.java:155) at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) 
at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException 
at java.io.DataInputStream.readInt(DataInputStream.java:375) at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) 
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) 
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) 
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
... 3 more

I see no exceptions related to this on the destination node 
(2001:4c28:1:413:0:1:1:12:1).

This makes the whole map task fail with:

2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:forsberg (auth:SIMPLE) cause:java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
	at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244)
	at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209)
	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

The failed task was on hadoop worker node hdp01-12-4.

However, hadoop later retries this map task on a different hadoop worker node (hdp01-10-2),
and that retry succeeds.

So that's weird, but I could live with it. Now, however, comes the real trouble - the hadoop
job does not finish due to one task running on hdp01-12-4 being stuck with this:

Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" java.lang.IllegalStateException:
target reports current file is /opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
but is /opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000000_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
	at org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154)
	at org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45)
	at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199)
	at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
	at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:662)

This just sits there forever, or at least until the hadoop task timeout kicks in.

So two questions here:

1) Any clues on what might cause the first EOFException? It seems to appear for *some* of
my bulkloads. Not all, but frequent enough to be a problem. Like, every 10:th bulkload I do
seems to have the problem.

2) The second problem I have a feeling could be related to https://issues.apache.org/jira/browse/CASSANDRA-4223,
but with the extra quirk that with the bulkload case, we have *multiple java processes* creating
streaming sessions on the same host, so streaming session IDs are not unique.

I'm thinking 2) happens because the EOFException made the streaming session in 1) sit around
on the target node without being closed.

This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid upgrading
until I have made this migration from old to new hardware. Upgrading to 1.2.13 might be an
option.

Any hints welcome.

Thanks,
\EF




  






Mime
View raw message