hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcos Ortiz <mlor...@uci.cu>
Subject Re: importing a large table
Date Fri, 30 Mar 2012 14:11:12 GMT


On 03/30/2012 04:54 AM, Rita wrote:
> Thanks for the responses. I am using 0.90.4-cdh3. i exported the table
> using hbase exporter. Yes, the previous table still exists but on a
> different cluster.My region servers are large, close to 12GB in size.
Which is the total number of your regions?
>
> I want to understand regarding Hfiles. We export the table as a series of
> Hfiles and then import them in?
Yes, The simplest way to do this is using the TableOutputFormat, but if 
you use instead the HFileOutputFormat, the process will be more 
efficient, because using this feature (bulk loads) will use less CPU and 
network. With a MapReduce job, you prepare your data using the 
HFileOutputFormat (Hadoop's TotalOrderPartitioner class in used to 
partition the map output
into disjoint ranges of the key space, corresponding to the key ranges 
of the regions in the table).


> What is the difference between that in the
> regular MR export job?
The main difference with regular MR jobs is the output, instead to use 
the classic ouput formats like TextOutputFormat, MultipleOutputFormat, 
SequenceFileOutputFormat, etc, you will use the HFileOutputFormat, that 
is the native data file type for HBase (HFile).
>   I idea sounds good because it sounds simple on the
> surface :-)

>
>
>
>
> On Fri, Mar 30, 2012 at 12:08 AM, Stack<stack@duboce.net>  wrote:
>
>> On Thu, Mar 29, 2012 at 7:57 PM, Rita<rmorgan466@gmail.com>  wrote:
>>> Hello,
>>>
>>> I am importing a 40+ billion row table which I exported several months
>> ago.
>>> The data size is close to 18TB on hdfs (3x replication).
>>>
>> Does the table from back then still exist?  Or do you remember what
>> the key spread was like?  Could you precreate the old table?
>>
>>> My problem is when I try to import it with mapreduce it takes a few days
>> --
>>> which is ok -- however when the job fails to whatever reason, I have to
>>> restart everything. Is it possible to import the table in chunks like,
>>> import 1/3, 2/3, and then finally 3/3  of the table?
>>>
>> Yeah.  Funny how the plug gets pulled on the rack when the three day
>> job is at the end 95% done.
>>
>>> Btw, the jobs creates close to 150k mapper jobs, thats a problem waiting
>> to
>>> happen :-)
>>>
>> Are you running 0.92?  If not, you should and go for bigger regions.   10G?
>>
>> St.Ack
>>
>
>

-- 
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
  Data Engineer at UCI
  http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message