hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aggarwal, Vaibhav" <vagg...@amazon.com>
Subject RE: org.apache.hadoop.fs.ChecksumException: Checksum error:
Date Sat, 20 Aug 2011 00:57:52 GMT
This is a really curious case.

How many replicas of each block do you have?

Are you able to copy the data directly using HDFS client?
You could try the hadoop fs -copyToLocal command and see if it can copy the data from hdfs
correctly.

That would help you verify that the issue really is at HDFS layer (though it does look like
that from the stack trace).

Which file format are you using?

Thanks
Vaibhav

-----Original Message-----
From: W S Chung [mailto:qp.wschung@gmail.com] 
Sent: Friday, August 19, 2011 3:26 PM
To: user@hive.apache.org
Subject: org.apache.hadoop.fs.ChecksumException: Checksum error:

For some reason, my questions sent two days ago again never shows up, even though I can google
the question. I apologize if you have seen this question before.

After loading around 2G or so data in a few files into hive, the "select count(*) from table"
query keep failing. The JobTracker UI gives the following error:

     org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
at 51794944
       at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
       at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
       at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
       at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
       at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
       at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
       at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
       at java.io.DataInputStream.read(DataInputStream.java:83)
       at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
       at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
       at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
       at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
       at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
       at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
       at org.apache.hadoop.mapred.Child.main(Child.java:159)

fsck reports there are courrpted blocks. I have tried dropping the table and reload a few
time. As far as I can see, the behavior is somewhat different every time, in the sense of
how many corrupted blocks and how many files I loaded before the corrupted blocks appear.
Sometimes the corrupted blocks show up after the data is loaded and sometimes only after the
"select count(*)" query is made. I have tried setting "io.skip.checksum.errors" to true, but
has no effect at all.

I know that checksum error is usually an indication of hardware problem. But we are running
hive on NFS cluster and has ECC memory.
Our system admin here is not willing to believe that our high quality hardware has so many
issues. I did try installing a simpler single node hive on another machine and the problem
does not appear in this install after the data is loaded. Can someone give me some pointers
in what else to try? Thanks.

Mime
View raw message