hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum error
Date Wed, 22 Sep 2010 09:38:02 GMT
So the client code looks good, hard to say what exactly is going on.

BTW I opened this JIRA:
https://issues.apache.org/jira/browse/HBASE-3029

To address the confusing exception in this case.

It's hard to say why you get that exception under load... some systems
have been known to give weird flaky faults under load.  It used to be
compiling the linux kernel was a simple benchmark for RAM problems.
If you have time you could try memtest86 to see if the memory has
issues, since that is a common place of errors.

-ryan

On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <octo47@gmail.com> wrote:
> One more note. This database was 0.20.6 before. Then
> i start 0.89 over it.
> (but table with wrong checksum was created in 0.89 hbase)
>
> 2010/9/22 Andrey Stepachev <octo47@gmail.com>:
>> 2010/9/22 Ryan Rawson <ryanobjc@gmail.com>:
>>> why are you using such expensive disks?  raid + hdfs = lower
>>> performance than non-raid.
>>
>> It was database server, before we migrate to hbase. It was designed
>> for postgresql. Now with compression and hbase nature our database
>> is 12Gb instead of 180GB in pg.
>> So this server was not designed for hbase.
>> In production (0.20.6) we much lighter servers (3) with simle dual
>> sata drives.
>>
>>>
>>> how's your ram?  hows your network switches?  NICs?  etc etc.
>>> anything along the data path can introduce errors.
>>
>> no. all things on one machined. 17Gb ram (5GB hbase)
>>
>>>
>>> in this case we did the right thing and threw exceptions, but looks
>>> like your client continues to call next() despite getting
>>> exceptions... can you check your client code to verify this?
>>
>> hm. i check. but i use only simple wrapper around ResultScanner
>> http://pastebin.org/1074628. It should bail out on exception (except
>> ScannerTimeoutException)
>>
>>>
>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <octo47@gmail.com> wrote:
>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core.
>>>>
>>>> 2010/9/22 Ryan Rawson <ryanobjc@gmail.com>:
>>>>> generally checksum errors are due to hardware faults of one kind or another.
>>>>>
>>>>> what is your hardware like?
>>>>>
>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <octo47@gmail.com>
wrote:
>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator
>>>>>> which reopen scanner on timeout. But what is best way to reopen scanner.
>>>>>> Can you point me where i can find all this exceptions? Or may be
>>>>>> here already some sort for recoveratble iterator?
>>>>>>
>>>>>> 2010/9/22 Ryan Rawson <ryanobjc@gmail.com>:
>>>>>>> ah ok i think i get it... basically at this point your scanner
is bad
>>>>>>> and iterating on it again won't work.  the scanner should probably
>>>>>>> self close itself so you get tons of additional exceptions but
instead
>>>>>>> we dont.
>>>>>>>
>>>>>>> there is probably a better fix for this, i'll ponder
>>>>>>>
>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
>>>>>>>> very strange... looks like a bad block ended up in your scanner
and
>>>>>>>> subsequent nexts were failing due to that short read.
>>>>>>>>
>>>>>>>> did you have to kill the regionserver or did things recover
and
>>>>>>>> continue normally?
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <octo47@gmail.com>
wrote:
>>>>>>>>> Hi All.
>>>>>>>>>
>>>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table
on heavy
>>>>>>>>> write in standalone mode.
>>>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and
then 5 threads
>>>>>>>>> writes data to it.
>>>>>>>>> At some moment exception thrown.
>>>>>>>>>
>>>>>>>>> Andrey.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message