Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.105 as permitted sender)
Message-ID: <BLU0-SMTP4265B328677F18062F4D1338FBB0@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"
MIME-Version: 1.0 (Mac OS X Mail 6.0 \(1485\))
Subject: Re: issues copying data from one table to another
From: Michael Segel <michael_segel@hotmail.com>
In-Reply-To: 
 <CAPP5OR81BHNZMdHfoXfsy08XRC1b7ZR3QMMUNW3Dra8-bPrnqg@mail.gmail.com>
Date: Sat, 18 Aug 2012 06:14:38 -0500
Content-Transfer-Encoding: quoted-printable
References: 
 <CAPP5OR81BHNZMdHfoXfsy08XRC1b7ZR3QMMUNW3Dra8-bPrnqg@mail.gmail.com>
To: user@hbase.apache.org

Can you disable the table?=20
How much free disk space do you have?=20

Is  this a production cluster?
Can you upgrade to CDH3u5?

Are you running a capacity scheduler or fair scheduler?

Just out of curiosity, what would happen if you could disable the table, =
alter the table's max file size and then attempted to merge regions?  =
Note: I've never tried this, don't know if its possible, just thinking =
outside of the box...

Outside of that... the safest way to do this would be to export the =
table. You'll get 2800 mappers so if you are using a scheduler, you just =
put this in to a queue that limits the number of concurrent mappers.=20

When you import the data, in to your new table, you can run on an even =
more restrictive queue so that you have less of an impact on your =
system.  The downside is that its going to take a bit longer to run. =
Again, its probably the safest way to do this....

HTH,=20

-Mike

On Aug 17, 2012, at 2:17 PM, Norbert Burger <norbert.burger@gmail.com> =
wrote:

> Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
> from an existing table that has far too many regions (2600+ for only 8
> regionservers) into one with a more reasonable region count for this
> cluster (256).  Overall data volume is approx. 3 TB.
>=20
> I thought initially that I'd use the bulkload/importtsv approach, but
> it turns out this table's schema has column qualifiers made from
> timestamps, so it's impossible for me to specify a list of target
> columns for importtsv.  =46rom what I can tell, the TSV interchange
> format requires your data to have the same colquals throughout.
>=20
> I took a look at CopyTable and Export/Import, which both appear to
> wrap the Hbase client API (emitting Puts from a mapper).  But I'm
> seeing significant performance problems with this approach, to the
> point that I'm not sure it's feasible.  Export appears to work OK, but
> when I try importing the data back from HDFS, the rest of our cluster
> drags to halt -- client writes (even those not associated with the
> Import) start timing out.  Fwiw, import already disables autoFlush
> (via TableOutputFormat).
>=20
> =46rom [1], one option I could try would to disable the WAL.  Are =
there
> are other techniques I should try?  Has anyone implemented a
> bulkloader which doesn't use the TSV format?
>=20
> Norbert
>=20
> [1] http://hbase.apache.org/book/perf.writing.html
>=20