hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathaniel Cook <nathani...@qualtrics.com>
Subject Re: HBase Replication problems
Date Tue, 14 Dec 2010 18:34:34 GMT
Sounds good. I'll do some digging around.

On Tue, Dec 14, 2010 at 11:31 AM, Jean-Daniel Cryans
<jdcryans@apache.org> wrote:
> Good!
>
> I'm not sure why it's not working for you with two ensembles... here
> it works between two clusters that are in two different datacenters
> using different ZK ensembles. You could try inserting debug statements
> in the code and see where the mix up happens.
>
> Thx,
>
> J-D
>
> On Tue, Dec 14, 2010 at 10:26 AM, Nathaniel Cook
> <nathanielc@qualtrics.com> wrote:
>> So, I got it working :)
>>
>> Because of these strange connection/configuration issues I decided to
>> just service both clusters from one ZK quorum. I just set the
>> zookeeper.znode.parent to hbase_bk and then set up the replication
>> again and all it working. It is even keeping up with some initial load
>> testing. Thanks.
>>
>> I think we should still look into why it couldn't talk to two
>> different ZK quorums but this works for now.
>>
>>
>> On Mon, Dec 13, 2010 at 5:38 PM, Nathaniel Cook
>> <nathanielc@qualtrics.com> wrote:
>>> Yes correct IP address.
>>>
>>> On Mon, Dec 13, 2010 at 5:24 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>> Just to be clear, does ping show the right IP address too? That's the
>>>> real concern here.
>>>>
>>>> Thx
>>>>
>>>> J-D
>>>>
>>>> On Mon, Dec 13, 2010 at 4:16 PM, Nathaniel Cook
>>>> <nathanielc@qualtrics.com> wrote:
>>>>> The hostnames are resolving fine. I can ping bk1-4 from ds1-4 and vise
versa.
>>>>>
>>>>> On Mon, Dec 13, 2010 at 5:11 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>> It sounds like your master cluster resolves bk1-4 as ds1-4. Could
you
>>>>>> check that by doing a ping on those hostnames from those machines?
>>>>>> Else... I can't see what could be the error at the moment...
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Mon, Dec 13, 2010 at 3:55 PM, Nathaniel Cook
>>>>>> <nathanielc@qualtrics.com> wrote:
>>>>>>> Running the 'ls /hbase/rs' cmd through zkcli  on the master
I get:
>>>>>>>
>>>>>>> [ds2.internal,60020,1292278767510, ds3.internal,60020,1292278776930,
>>>>>>> ds1.internal,60020,1292278759087, ds4.internal,60020,1292278792724
>>>>>>>
>>>>>>> On my slave cluster I get:
>>>>>>>
>>>>>>> [bk1.internal,60020,1292278881467, bk3.internal,60020,1292278895189,
>>>>>>> bk2.internal,60020,1292278888034, bk4.internal,60020,1292278905096]
>>>>>>>
>>>>>>> But as I mentioned the peer it chooses is ds4 from the master
cluster.
>>>>>>>
>>>>>>> Could it be that for some reason the Configuration passed to
the
>>>>>>> ZooKeeperWrapper.createInstance for the slave cluster isn't honored
>>>>>>> and is defaulting to the local connection settings? I am running
a
>>>>>>> QuorumPeer on the same machine as the RegionServers for these
test
>>>>>>> clusters. Could it be finding the zoo.cfg file on that machine
that
>>>>>>> points to the local quorum?
>>>>>>>
>>>>>>> To test this i wrote a quick jruby script...
>>>>>>> #------------------------------------------------------
>>>>>>> include Java
>>>>>>> import org.apache.hadoop.hbase.HBaseConfiguration
>>>>>>> import org.apache.hadoop.hbase.HConstants
>>>>>>> import org.apache.hadoop.conf.Configuration
>>>>>>> import org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
>>>>>>>
>>>>>>>
>>>>>>> parts1 = ARGV[0].split(":")
>>>>>>>
>>>>>>> c1 = HBaseConfiguration.create()
>>>>>>> c1.set(HConstants::ZOOKEEPER_QUORUM, parts1[0])
>>>>>>> c1.set("hbase.zookeeper.property.clientPort", parts1[1])
>>>>>>> c1.set(HConstants::ZOOKEEPER_ZNODE_PARENT, parts1[2])
>>>>>>>
>>>>>>>
>>>>>>> zkw = ZooKeeperWrapper.createInstance(c1, "ZK")
>>>>>>>
>>>>>>> zkw.writeZNode(parts1[2], "test", "")
>>>>>>>
>>>>>>> #------------------------------------------------------------
>>>>>>>
>>>>>>> I ran it from the master cluster and gave it the address of the
slave
>>>>>>> quorum with this command:
>>>>>>>
>>>>>>> hbase org.jruby.Main testZK.rb bk1,bk2,bk3:2181:/hbase
>>>>>>>
>>>>>>> The slave ZK quorum didn't have the '/hbase/test' node but the
master
>>>>>>> ZK quorum did. The script didn't honor the specified configuration.
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 13, 2010 at 4:04 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>> Interesting... the fact that it says that it's connecting
to
>>>>>>>> bk1,bk2,bk3 means that it's looking at the right zookeeper
ensemble.
>>>>>>>> What it does next is reading all the znodes in /hbase/rs/
(which is
>>>>>>>> the list of live region servers) and chooses a subset of
it.
>>>>>>>>
>>>>>>>> Using the zcli utility, could you check the value of those
znodes and
>>>>>>>> see if it makes sense? You can run it like that:
>>>>>>>>
>>>>>>>> bin/hbase zkcli
>>>>>>>>
>>>>>>>> And it will be run against the ensemble that that cluster
is using.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook
>>>>>>>> <nathanielc@qualtrics.com> wrote:
>>>>>>>>> When the master cluster chooses a peer it is supposed
to choose a peer
>>>>>>>>> from the slave cluster correct?
>>>>>>>>>
>>>>>>>>> This is what I am seeing in the master cluster logs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Added new peer cluster bk1,bk2,bk3,2181,/hbase
>>>>>>>>> Getting 1 rs from peer cluster # test
>>>>>>>>> Choosing peer 192.168.1.170:60020
>>>>>>>>>
>>>>>>>>> But 192.168.1.170 is an address in the master cluster.
I think this
>>>>>>>>> may be related to the problem I had while running the
add_peer.rb
>>>>>>>>> script. When I ran that script it would only talk to
the ZK quorum
>>>>>>>>> running on that machine and would not talk to the slave
ZK quorum .
>>>>>>>>> Could it be that when it is trying to choose a peer,
instead of going
>>>>>>>>> to the slave ZK quorum  running on a different machine
it is talking
>>>>>>>>> only to the ZK quorum running on its localhost?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook
>>>>>>>>> <nathanielc@qualtrics.com> wrote:
>>>>>>>>>> Thanks for looking into this with me.
>>>>>>>>>>
>>>>>>>>>> Ok so on the master region servers I am getting the
two statements
>>>>>>>>>> 'Replicating x' and 'Replicated in total: y'
>>>>>>>>>>
>>>>>>>>>> Nothing on the slave cluster.
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans
>>>>>>>>>> <jdcryans@apache.org> wrote:
>>>>>>>>>>> Hi Nathaniel,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for trying out replication, let's make
it work for you.
>>>>>>>>>>>
>>>>>>>>>>> So on the master-side there's 2 lines that are
important to make sure
>>>>>>>>>>> that replication works, first it has to say:
>>>>>>>>>>>
>>>>>>>>>>> Replicating x
>>>>>>>>>>>
>>>>>>>>>>> Where x is the number of edits it's going to
ship, and then
>>>>>>>>>>>
>>>>>>>>>>> Replicated in total: y
>>>>>>>>>>>
>>>>>>>>>>> Where y is the total number it replicated. Seeing
the second line
>>>>>>>>>>> means that replication was successful, at least
from the master point
>>>>>>>>>>> of view.
>>>>>>>>>>>
>>>>>>>>>>> On the slave, one node should have:
>>>>>>>>>>>
>>>>>>>>>>> Total replicated: z
>>>>>>>>>>>
>>>>>>>>>>> And that z is the number of edits that that region
server applied on
>>>>>>>>>>> it's cluster. It could be on any region server,
since the sink for
>>>>>>>>>>> replication is chose at random.
>>>>>>>>>>>
>>>>>>>>>>> Do you see those? Any exceptions around those
logs apart from EOFs?
>>>>>>>>>>>
>>>>>>>>>>> Thx,
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook
>>>>>>>>>>> <nathanielc@qualtrics.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to setup replication for my HBase
clusters. I have two
>>>>>>>>>>>> small clusters for testing each with 4 machines.
The setup for the two
>>>>>>>>>>>> clusters is identical. Each machine runs
a DataNode, and
>>>>>>>>>>>> HRegionServer. Three of the machines run
a ZK peer and one machine
>>>>>>>>>>>> runs the HMaster and NameNode. The cluster
master machines have
>>>>>>>>>>>> hostnames (ds1,ds2 ...) and the slave cluster
is (bk1, bk2 ...). I set
>>>>>>>>>>>> the replication  scope to 1 for my test
table column families and set
>>>>>>>>>>>> the hbase.replication property to true for
both clusters. Next I ran
>>>>>>>>>>>> the add_peer.rb script with the following
command on the ds1 machine:
>>>>>>>>>>>>
>>>>>>>>>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb
>>>>>>>>>>>> ds1:2181:/hbase bk1:2181:/hbase
>>>>>>>>>>>>
>>>>>>>>>>>> After the script finishes ZK for the master
cluster has the
>>>>>>>>>>>> replication znode and children of peers,
master, and state. The slave
>>>>>>>>>>>> ZK didn't have a replication znode. I fixed
that problem by rerunning
>>>>>>>>>>>> the script on the bk1 machine and commenting
out the code to write to
>>>>>>>>>>>> the master ZK. Now the slave ZK has the /hbase/replication/master
>>>>>>>>>>>> znode with data (ds1:2181:/hbase). Everthing
looked to be configured
>>>>>>>>>>>> correctly. I restarted the clusters. The
logs of the master
>>>>>>>>>>>> regionservers stated:
>>>>>>>>>>>>
>>>>>>>>>>>> This cluster (ds1:2181:/hbase) is a master
for replication, compared
>>>>>>>>>>>> with (ds1:2181:/hbase)
>>>>>>>>>>>>
>>>>>>>>>>>> The logs on the slave cluster stated:
>>>>>>>>>>>>
>>>>>>>>>>>> This cluster (bk1:2181:/hbase) is a slave
for replication, compared
>>>>>>>>>>>> with (ds1:2181:/hbase)
>>>>>>>>>>>>
>>>>>>>>>>>> Using the hbase shell I put a row into the
test table.
>>>>>>>>>>>>
>>>>>>>>>>>> The regionserver for that table had a log
statement like:
>>>>>>>>>>>>
>>>>>>>>>>>> Going to report log #192.168.1.166%3A60020.1291757445179
for position
>>>>>>>>>>>> 15828 in hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166
>>>>>>>>>>>> <http://192.168.1.166/>%3A60020.1291757445179
>>>>>>>>>>>>
>>>>>>>>>>>> (192.168.1.166 is ds1)
>>>>>>>>>>>>
>>>>>>>>>>>> I wait and even after several minutes the
row still does not appear in
>>>>>>>>>>>> the slave cluster table.
>>>>>>>>>>>>
>>>>>>>>>>>> Any help with what the problem might be is
greatly appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>> Both clusters are using a CDH3b3. The HBase
version is exactly
>>>>>>>>>>>> 0.89.20100924+28.
>>>>>>>>>>>>
>>>>>>>>>>>> -Nathaniel Cook
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -Nathaniel Cook
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -Nathaniel Cook
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -Nathaniel Cook
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Nathaniel Cook
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -Nathaniel Cook
>>>
>>
>>
>>
>> --
>> -Nathaniel Cook
>>
>



-- 
-Nathaniel Cook

Mime
View raw message