hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Schless <patrick.schl...@gmail.com>
Subject Re: HDFS Restart with Replication
Date Tue, 06 Aug 2013 16:28:43 GMT
Hi J-D,

Thanks for the help.

I tried your suggestion ("hbase-daemon.sh stop master"), and this leaves
all the region servers running. This seems the same as the problematic case
I was in when I was stopping only the HMaster, and not the region servers,
and then bouncing HDFS. It seems like I want to make sure everything
(HMaster & region servers) is stopped before stopping HDFS.

What's the problem with stopping the region servers before/after the
master? It does seem to work (no missing blocks, in my tests), but I don't
want to do it in prod if there's some risk of corruption.

Thanks,
Patrick


On Fri, Aug 2, 2013 at 5:33 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Ah then doing "bin/hbase-daemon.sh stop master" on the master node is
> the equivalent, but don't stop the region server themselves as the
> master will take care of it. Doing a stop on the master and the region
> servers will screw things up.
>
> J-D
>
> On Fri, Aug 2, 2013 at 3:28 PM, Patrick Schless
> <patrick.schless@gmail.com> wrote:
> > Doesn't stop-hbase.sh (and its ilk) require the server to be able to
> manage
> > the clients (using unpassworded SSH keys, for instance)? I don't have
> that
> > set up (for security reasons). I use capistrano for all these sort of
> > coordination tasks.
> >
> >
> > On Fri, Aug 2, 2013 at 12:07 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Doing a bin/stop-hbase.sh is the way to go, then on the Hadoop side
> >> you do stop-all.sh. I think your ordering is correct but I'm not sure
> >> you are using the right commands.
> >>
> >> J-D
> >>
> >> On Fri, Aug 2, 2013 at 8:27 AM, Patrick Schless
> >> <patrick.schless@gmail.com> wrote:
> >> > Ah, I bet the issue is that I'm stopped the HMaster, but not the
> Region
> >> > Servers, then restarting HDFS. What's the correct order of operations
> for
> >> > bouncing everything?
> >> >
> >> >
> >> > On Thu, Aug 1, 2013 at 5:21 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> Can you follow the life of one of those blocks though the Namenode
> and
> >> >> datanode logs? I'd suggest you start by doing a fsck on one of those
> >> >> files with the option that gives the block locations first.
> >> >>
> >> >> By the way why do you have split logs? Are region servers dying every
> >> >> time you try out something?
> >> >>
> >> >> On Thu, Aug 1, 2013 at 3:16 PM, Patrick Schless
> >> >> <patrick.schless@gmail.com> wrote:
> >> >> > Yup, 14 datanodes, all check back in. However, all of the corrupt
> >> files
> >> >> > seem to be splitlogs from data05. This is true even though I've
> done
> >> >> > several restarts (each restart adding a few missing blocks).
> There's
> >> >> > nothing special about data05, and it seems to be in the cluster,
> the
> >> same
> >> >> > as anyone else.
> >> >> >
> >> >> >
> >> >> > On Thu, Aug 1, 2013 at 5:04 PM, Jean-Daniel Cryans <
> >> jdcryans@apache.org
> >> >> >wrote:
> >> >> >
> >> >> >> I can't think of a way how your missing blocks would be related
to
> >> >> >> HBase replication, there's something else going on. Are all
the
> >> >> >> datanodes checking back in?
> >> >> >>
> >> >> >> J-D
> >> >> >>
> >> >> >> On Thu, Aug 1, 2013 at 2:17 PM, Patrick Schless
> >> >> >> <patrick.schless@gmail.com> wrote:
> >> >> >> > I'm running:
> >> >> >> > CDH4.1.2
> >> >> >> > HBase 0.92.1
> >> >> >> > Hadoop 2.0.0
> >> >> >> >
> >> >> >> > Is there an issue with restarting a standby cluster with
> >> replication
> >> >> >> > running? I am doing the following on the standby cluster:
> >> >> >> >
> >> >> >> > - stop hmaster
> >> >> >> > - stop name_node
> >> >> >> > - start name_node
> >> >> >> > - start hmaster
> >> >> >> >
> >> >> >> > When the name node comes back up, it's reliably missing
blocks.
> I
> >> >> started
> >> >> >> > with 0 missing blocks, and have run through this scenario
a few
> >> times,
> >> >> >> and
> >> >> >> > am up to 46 missing blocks, all from the table that is
the
> standby
> >> for
> >> >> >> our
> >> >> >> > production table (in a different datacenter). The missing
blocks
> >> all
> >> >> are
> >> >> >> > from the same table, and look like:
> >> >> >> >
> >> >> >> > blk_-2036986832155369224 /hbase/splitlog/
> >> data01.sea01.staging.tdb.com
> >> >> >> > ,60020,1372703317824_hdfs%3A%2F%
> 2Fname-node.sea01.staging.tdb.com
> >> >> >> > %3A8020%2Fhbase%2F.logs%2Fdata05.sea01.staging.tdb.com
> >> >> >> > %2C60020%2C1373557074890-splitting%
> 2Fdata05.sea01.staging.tdb.com
> >> >> >> >
> >> >> >>
> >> >>
> >>
> %252C60020%252C1373557074890.1374960698485/tempodb-data/c9cdd64af0bfed70da154c219c69d62d/recovered.edits/0000000001366319450.temp
> >> >> >> >
> >> >> >> > Do I have to stop replication before restarting the standby?
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Patrick
> >> >> >>
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message