hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Re: A better way to migrate the whole cluster?
Date Fri, 15 Aug 2014 17:38:42 GMT
I agree it would be nice if this was provided by HBase, but it's already
possible to work straight with the HFiles.  All you need is a custom hadoop
job.  A good starting point is
https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/hadoop/Backup.java
and modify it to your needs. We've used our own modification of this job
many times when we do our own cluster migrations.  The idea is that it is
incremental, so as HFiles get compacted, deleted, etc, you can just run it
again and move smaller and smaller amounts of data.

Working at the hdfs level should be faster, as you can use more mappers.
You will still be taxing the IO of the source cluster, but not adding load
to the actual regionserver processes (ipc queue, memory, etc).

If you upgrade to CDH5 (or the equivalent hdfs version), you can use hdfs
snapshots to minimize the need to re-run the above Backup job (since you
are already using replication to keep data up-to-date).


On Fri, Aug 15, 2014 at 1:11 PM, Esteban Gutierrez <esteban@cloudera.com>
wrote:

> 1.8TB in a day is not terrible slow if that number comes from the CopyTable
> counters and you are moving data across data centers using public networks,
> that should be about 20MB/sec. Also, CopyTable won't compress anything on
> the wire so the network overhead should be a lot. If you use anything like
> snappy for block compression and/or fast_diff for block encoding the
> HFiles, then using snapshots and export them using the ExportSnapshot tool
> should be the way to go.
>
> cheers,
> esteban.
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Thu, Aug 14, 2014 at 11:24 PM, tobe <tobeg3oogle@gmail.com> wrote:
>
> > Thank @lars.
> >
> > We're using HBase 0.94.11 and follow the instruction to run `./bin/hbase
> > org.apache.hadoop.hbase.mapreduce.CopyTable
> --peer.adr=hbase://cluster_name
> > table_name`. We have namespace service to find the ZooKeeper with
> > "hbase://cluster_name". And the job ran on a shared yarn cluster.
> >
> > The performance is affected by many factors, but we haven't found out the
> > reason. It would be great to see your suggestions.
> >
> >
> > On Fri, Aug 15, 2014 at 1:34 PM, lars hofhansl <larsh@apache.org> wrote:
> >
> > > What version of HBase? How are you running CopyTable? A day for 1.8T is
> > > not what we would expect.
> > > You can definitely take a snapshot and then export the snapshot to
> > another
> > > cluster, which will move the actual files; but CopyTable should not be
> so
> > > slow.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: tobe <tobeg3oogle@gmail.com>
> > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > Cc: dev@hbase.apache.org
> > > Sent: Thursday, August 14, 2014 8:18 PM
> > > Subject: A better way to migrate the whole cluster?
> > >
> > >
> > > Sometimes our users want to upgrade their servers or move to a new
> > > datacenter, then we have to migrate the data from HBase. Currently we
> > > enable the replication from the old cluster to the new cluster, and run
> > > CopyTable to move the older data.
> > >
> > > It's a little inefficient. It takes more than one day to migrate 1.8T
> > data
> > > and more time to verify. Can we have a better way to do that, like
> > snapshot
> > > or purely HDFS files?
> > >
> > > And what's the best practise or your valuable experience?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message