hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Roling <ben.rol...@gmail.com>
Subject Re: Pattern for Bulk Loading to Remote HBase Cluster
Date Fri, 10 Mar 2017 20:34:30 GMT
> If I understand your question you are asking, how to completebulkload
files which are on cluster1 to cluster2 without copying them to cluster2.
Answer is with the existing code it's not possible.

No, this isn't quite my question.  I understand completebulkload cannot
load files on cluster1 without copying them.  My question is how can/should
the job producing the files know and write them to cluster2 instead of
cluster1.  If the job producing them produces them on cluster2, then
completebulkload can move them instead of copy them.

The question sort of boils down to how cluster environmental knowledge gets
shared between client and server.  In the case of a bulk load, you have
some client (user) code that is preparing StoreFiles and invoking the
completebulkload with pointers to those files and you have the
RegionServers that are taking those files and moving or copying them into
the HBase storage directory on the HDFS filesystem HBase is configured to
run on top of.

Ideally the client and server should each have the minimum possible
knowledge about one another.  For example, from the client perspective, the
minimum knowledge is to know the ZooKeeper quorum addresses.  With this,
the client can connect to and interact with HBase.

In the case of a bulk load, some filesystem knowledge is required.  In most
cases, folks operate in single cluster environments where HBase and other
things share the same HDFS filesystem and as such you don't even really
have to think about it.  Really though, it is best to remember that there
are two filesystems:

1) the client's filesystem -- the filesystem to which the client writes
files.  Often the client will write files to their user directory on the
default filesystem where they run their jobs.

2) the server's filesystem - where HBase stores it's files

The client and server filesystem need not be the same.  In the remote HBase
bulk load scenario I describe, clearly they are not the same.

What I would like to know is a mechanism for the server to inform the
client of the server's filesystem.   With this knowledge, a client doing
bulk loads can choose to (or perhaps automatically) write files directly to
the server's filesystem.

I know I can just manually impart some knowledge of the server's filesystem
onto the client by looking it up and configuring it into the client, but I
would prefer for this knowledge to flow from the server to the client.
This way the client isn't effectively making its own assumption about what
the filesystem of the server is and gives maximum flexibility such that,
for example, server administrators could move the server to a different
filesystem without breaking clients.

On Fri, Mar 10, 2017 at 12:08 AM ashish singhi <ashish.singhi@huawei.com>
wrote:

> If I understand your question you are asking, how to completebulkload
> files which are on cluster1 to cluster2 without copying them to cluster2.
> Answer is with the existing code it's not possible.
>
> Bq. How do I choose hdfs://storefile-outputdir in a way that does not
> perform an extra copy operation when completebulkload is invoked, without
> assuming knowledge of HBase server implementation details?
>
> You can configure the output dir to remote cluster active Namenode IP, so
> that the output of importtsv is written there and then use completebulkload
> in the remote cluster specifying this output dir path as it argument.
>
> Bq. In essence, how does my client application know that it should write to
> hdfs://cluster2 even though the application is running in a context where
> fs.defaultFs is hdfs://cluster1?
>
> If you are talking about importtsv then it read the URI from the path and
> connect to that respective NN. If you use the nameservices name in the path
> instead of active NN IP then you may have to write your own code something
> similar to importtsv where you can construct remote cluster configuration
> object and use it to write output there. You can refer HBASE-13153 for an
> idea to understand it much better.
>
> -----Original Message-----
> From: Ben Roling [mailto:ben.roling@gmail.com]
> Sent: 09 March 2017 19:53
> To: user@hbase.apache.org
> Subject: Re: Pattern for Bulk Loading to Remote HBase Cluster
>
> I'm not sure you understand my question.  Or perhaps I just don't quite
> understand yours?
>
> I'm not using importtsv.  If I was, and I was using the form that prepares
> StoreFiles for completebulkload, then my question would be, how do I
> (generically as an application acting as an HBase client, and using
> importtsv to load data) choose the path to which I write the StoreFiles?
>
> The following is an example of importtsv from the documentation:
>
> bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.columns=a,b,c
> -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename>
> <hdfs-data-inputdir>
>
> How do I choose hdfs://storefile-outputdir in a way that does not perform
> an extra copy operation when completebulkload is invoked, without assuming
> knowledge of HBase server implementation details?
>
> In essence, how does my client application know that it should write to
> hdfs://cluster2 even though the application is running in a context where
> fs.defaultFs is hdfs://cluster1?
>
> How does the HBase installation share this information with client
> applications?
>
> I know I can just go dig into the hdfs-site.xml on a RegionServer and
> figure this out (such as by looking at "hbase.rootdir" there), but my
> question is how to do it from the perspective of a generic HBase client
> application?
>
> On Wed, Mar 8, 2017 at 11:13 PM ashish singhi <ashish.singhi@huawei.com>
> wrote:
>
> > Hi,
> >
> > Did you try giving the importtsv output path to remote HDFS ?
> >
> > Regards,
> > Ashish
> >
> > -----Original Message-----
> > From: Ben Roling [mailto:ben.roling@gmail.com]
> > Sent: 09 March 2017 03:22
> > To: user@hbase.apache.org
> > Subject: Pattern for Bulk Loading to Remote HBase Cluster
> >
> > My organization is looking at making some changes that would introduce
> > HBase bulk loads that write into a remote cluster.  Today our bulk
> > loads write to a local HBase.  By local, I mean the home directory of
> > the user preparing and executing the bulk load is on the same HDFS
> > filesystem as the HBase cluster.  In the remote cluster case, the
> > HBase being loaded to will be on a different HDFS filesystem.
> >
> > The thing I am wondering about is what the best pattern is for
> > determining the location to write HFiles to from the job preparing the
> bulk load.
> > Typical examples write the HFiles somewhere in the user's home directory.
> > When HBase is local, that works perfectly well.  With remote HBase, it
> > can work, but results in writing the files twice: once from the
> > preparation job and a second time by the RegionServer when it reacts
> > to the bulk load by copying the HFiles into the filesystem it is running
> on.
> >
> > Ideally the preparation job would have some mechanism to know where to
> > write the files such that they are initially written on the same
> > filesystem as HBase itself.  This way the bulk load can simply move
> > them into the HBase storage directory like happens when bulk loading to
> a local cluster.
> >
> > I've considered a pattern where the bulk load preparation job reads
> > the hbase.rootdir property and pulls the filesystem off of that.
> > Then, it sticks the output in some directory (e.g. /tmp) on that same
> filesystem.
> > I'm inclined to think that hbase.rootdir should only be considered a
> > server-side property and as such I shouldn't expect it to be present
> > in client configuration.  Under that assumption, this isn't really a
> > workable strategy.
> >
> > It feels like HBase should have a mechanism for sharing a staging
> > directory with clients doing bulk loads.  Doing some searching, I ran
> > across "hbase.bulkload.staging.dir", but my impression is that its
> > intent does not exactly align with mine.  I've read about it here [1].
> > It seems the idea is that users prepare HFiles in their own directory,
> > then SecureBulkLoad moves them to "hbase.bulkload.staging.dir".  A
> > move like that isn't really a move when dealing with a remote HBase
> > cluster.  Instead it is a copy.  A question would be why doesn't the
> > job just write the files to "hbase.bulkload.staging.dir" initially and
> > skip the extra step of moving them?
> >
> > I've been inclined to invent my own application-specific Hadoop
> > property to use to communicate an HBase-local staging directory with
> > my bulk load preparation jobs.  I don't feel perfectly good about that
> > idea though.  I'm curious to hear experiences or opinions from others.
> > Should I have my bulk load prep jobs look at "hbase.rootdir" or
> > "hbase.bulkload.staging.dir" and make sure those get propagated to
> > client configuration?  Is there some other mechanism that already
> > exists for clients to discover an HBase-local directory to write the
> files?
> >
> > [1] http://hbase.apache.org/book.html#hbase.secure.bulkload
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message