hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Rolling out Hadoop/HBase updates
Date Mon, 05 Jul 2010 17:33:47 GMT
On Sun, Jul 4, 2010 at 10:36 AM, Dan Harvey <dan.harvey@mendeley.com> wrote:
> Just looked into hdfs630 and it looks like it was added in
> cdh2 0.20.1+169.89 and we're currently on  0.20.1+169.68. So would it help
> prevent some of these issues by updating to that so we have the patch?
>

For sure Dan.  HDFS-630 will help at a minimum.
St.Ack


> Thanks,
>
> On 4 July 2010 18:12, Dan Harvey <dan.harvey@mendeley.com> wrote:
>
>> Hey,
>>
>> We're using stock CHD2 without any patches so I'm not sure if we have
>> hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
>> moving to 0.20.5 soon
>>
>> What I did with this rollout of just config changes was take one region
>> server down at a time and restart the datanode on the same server. So what I
>> gather I should have done was shutdown all the region servers before
>> restarting any of the data nodes?
>>
>> I guess if I split it into different parts it would be :-
>>
>> - HBase Rolling update for point/config releases is supported
>>   - Update masters first
>>   - Then update region servers in turn
>>
>> - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
>> list I guess)
>>   - Take down HBase
>>   - Take down datanodes
>>   - Update all the datanodes code/configs
>>   - Start datanodes
>>   - Start HBase
>>
>> Would you be able to let me know which of these I've got right/wrong?
>>
>> Thanks,
>>
>> On 29 June 2010 15:50, Michael Segel <michael_segel@hotmail.com> wrote:
>>
>>>
>>> Dan,
>>>
>>> I don't think you can do that because your 'new/updated' node will clash
>>> with the rest of the cloud.
>>> (We're talking code and not just cloud tuning parameters.) [Read different
>>> jars...]
>>>
>>> If you're going to push an update out, then it has to be an 'all or
>>> nothing' push.
>>>
>>> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents
>>> a full backup, down the cloud, remove the software completely, and then then
>>> install new CDH3. Outside of that major switch, if we were going from one
>>> sub release to another, it would be just a $> yum update hadoop-0.20 call
on
>>> each node.
>>> Again, you have to take the cloud down to do that.
>>>
>>> So the bottom line... if you're going to do upgrades, you'll need to plan
>>> for some down time.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> > From: dan.harvey@mendeley.com
>>> > Date: Tue, 29 Jun 2010 14:43:26 +0100
>>> > Subject: Rolling out Hadoop/HBase updates
>>> > To: user@hbase.apache.org
>>> >
>>> > Hey,
>>> >
>>> > I've been thinking about how we do out configuration and code updates
>>> for
>>> > Hadoop and HBase and was wondering what others do and what is the best
>>> > practice to avoid errors with HBase.
>>> >
>>> > Currently we do a rolling update where we restart the services on one
>>> node
>>> > at a time, so shutting down the region server then restarting the
>>> datanode
>>> > and task trackers depending on what we are updating and what has change.
>>> But
>>> > with this I have occasional found errors with the HBase cluster
>>> afterwards
>>> > due to corrupt META table which I think could have been caused by
>>> restarting
>>> > the datanode, or maybe not waiting long enough for the cluster to sort
>>> out
>>> > loosing a region server before moving on to the next.
>>> >
>>> > The most resent error upon restarting a node was :-
>>> >
>>> > 2010-06-29 10:46:44,970 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>>> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
>>> > java.io.IOException: Filesystem closed
>>> >         at
>>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>>> >
>>> > 2010-06-29 10:46:44,970 FATAL
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>>> > HRegionServer: file system not available
>>> > java.io.IOException: File system is not available
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
>>> >
>>> >
>>> > Followed by this for every region being served :-
>>> >
>>> > 2010-06-29 10:46:44,996 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>>> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
>>> > java.io.IOException: Filesystem closed
>>> >         at
>>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>>> >
>>> >
>>> > After updating all the nodes all the region server shut down after a
>>> > few minutes reporting the following :-
>>> >
>>> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
>>> > 10.0.11.4:50010
>>> >
>>> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
>>> > Could not append. Requesting close of hlog
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> >
>>> > 2010-06-29 11:22:09,482 FATAL
>>> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
>>> > ioe:
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> > 2010-06-29 11:22:10,344 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log
>>> in
>>> > abort
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> >
>>> > This was fixed by restarting the master and starting the region servers
>>> > again, but it would be nice to know how to roll out changes cleaner.
>>> >
>>> > How do other people here roll out updates to HBase / Hadoop? What order
>>> do
>>> > you restart services in and how long do you wait before moving to the
>>> next
>>> > node?
>>> >
>>> > Just so you know we currently have 5 nodes and are getting another 10 to
>>> add
>>> > soon.
>>> >
>>> > Thanks,
>>> >
>>> > --
>>> > Dan Harvey | Datamining Engineer
>>> > www.mendeley.com/profiles/dan-harvey
>>> >
>>> > Mendeley Limited | London, UK | www.mendeley.com
>>> > Registered in England and Wales | Company Number 6419015
>>>
>>> _________________________________________________________________
>>> Hotmail has tools for the New Busy. Search, chat and e-mail from your
>>> inbox.
>>>
>>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
>>>
>>
>>
>>
>> --
>> Dan Harvey | Datamining Engineer
>> www.mendeley.com/profiles/dan-harvey
>>
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company Number 6419015
>>
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Mime
View raw message