hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ashish singhi <ashish.sin...@huawei.com>
Subject RE: Pattern for Bulk Loading to Remote HBase Cluster
Date Thu, 09 Mar 2017 05:12:53 GMT

Did you try giving the importtsv output path to remote HDFS ?


-----Original Message-----
From: Ben Roling [mailto:ben.roling@gmail.com] 
Sent: 09 March 2017 03:22
To: user@hbase.apache.org
Subject: Pattern for Bulk Loading to Remote HBase Cluster

My organization is looking at making some changes that would introduce HBase bulk loads that
write into a remote cluster.  Today our bulk loads write to a local HBase.  By local, I mean
the home directory of the user preparing and executing the bulk load is on the same HDFS filesystem
as the HBase cluster.  In the remote cluster case, the HBase being loaded to will be on a
different HDFS filesystem.

The thing I am wondering about is what the best pattern is for determining the location to
write HFiles to from the job preparing the bulk load.
Typical examples write the HFiles somewhere in the user's home directory.
When HBase is local, that works perfectly well.  With remote HBase, it can work, but results
in writing the files twice: once from the preparation job and a second time by the RegionServer
when it reacts to the bulk load by copying the HFiles into the filesystem it is running on.

Ideally the preparation job would have some mechanism to know where to write the files such
that they are initially written on the same filesystem as HBase itself.  This way the bulk
load can simply move them into the HBase storage directory like happens when bulk loading
to a local cluster.

I've considered a pattern where the bulk load preparation job reads the hbase.rootdir property
and pulls the filesystem off of that.  Then, it sticks the output in some directory (e.g.
/tmp) on that same filesystem.
I'm inclined to think that hbase.rootdir should only be considered a server-side property
and as such I shouldn't expect it to be present in client configuration.  Under that assumption,
this isn't really a workable strategy.

It feels like HBase should have a mechanism for sharing a staging directory with clients doing
bulk loads.  Doing some searching, I ran across "hbase.bulkload.staging.dir", but my impression
is that its intent does not exactly align with mine.  I've read about it here [1].  It seems
the idea is that users prepare HFiles in their own directory, then SecureBulkLoad moves them
to "hbase.bulkload.staging.dir".  A move like that isn't really a move when dealing with a
remote HBase cluster.  Instead it is a copy.  A question would be why doesn't the job just
write the files to "hbase.bulkload.staging.dir" initially and skip the extra step of moving

I've been inclined to invent my own application-specific Hadoop property to use to communicate
an HBase-local staging directory with my bulk load preparation jobs.  I don't feel perfectly
good about that idea though.  I'm curious to hear experiences or opinions from others.  Should
I have my bulk load prep jobs look at "hbase.rootdir" or "hbase.bulkload.staging.dir" and
make sure those get propagated to client configuration?  Is there some other mechanism that
already exists for clients to discover an HBase-local directory to write the files?

[1] http://hbase.apache.org/book.html#hbase.secure.bulkload
View raw message