hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: issues copying data from one table to another
Date Sat, 18 Aug 2012 11:14:38 GMT
Can you disable the table? 
How much free disk space do you have? 

Is  this a production cluster?
Can you upgrade to CDH3u5?

Are you running a capacity scheduler or fair scheduler?

Just out of curiosity, what would happen if you could disable the table, alter the table's
max file size and then attempted to merge regions?  Note: I've never tried this, don't know
if its possible, just thinking outside of the box...

Outside of that... the safest way to do this would be to export the table. You'll get 2800
mappers so if you are using a scheduler, you just put this in to a queue that limits the number
of concurrent mappers. 

When you import the data, in to your new table, you can run on an even more restrictive queue
so that you have less of an impact on your system.  The downside is that its going to take
a bit longer to run. Again, its probably the safest way to do this....



On Aug 17, 2012, at 2:17 PM, Norbert Burger <norbert.burger@gmail.com> wrote:

> Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
> from an existing table that has far too many regions (2600+ for only 8
> regionservers) into one with a more reasonable region count for this
> cluster (256).  Overall data volume is approx. 3 TB.
> I thought initially that I'd use the bulkload/importtsv approach, but
> it turns out this table's schema has column qualifiers made from
> timestamps, so it's impossible for me to specify a list of target
> columns for importtsv.  From what I can tell, the TSV interchange
> format requires your data to have the same colquals throughout.
> I took a look at CopyTable and Export/Import, which both appear to
> wrap the Hbase client API (emitting Puts from a mapper).  But I'm
> seeing significant performance problems with this approach, to the
> point that I'm not sure it's feasible.  Export appears to work OK, but
> when I try importing the data back from HDFS, the rest of our cluster
> drags to halt -- client writes (even those not associated with the
> Import) start timing out.  Fwiw, import already disables autoFlush
> (via TableOutputFormat).
> From [1], one option I could try would to disable the WAL.  Are there
> are other techniques I should try?  Has anyone implemented a
> bulkloader which doesn't use the TSV format?
> Norbert
> [1] http://hbase.apache.org/book/perf.writing.html

View raw message