hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Jeliński <djelins...@gmail.com>
Subject Re: HBase as a file repository
Date Fri, 31 Mar 2017 04:25:40 GMT
Thank you Ted for your response.

I have read that part of HBase book. It never explained why objects over
10MB are no good, and did not suggest an alternative storage medium for

I have also read this:
And yet I'm trying to put 36TB on a machine. I certainly hope that the
number of region servers is the only real limiter to this.

I tried putting files larger than 1MB on HDFS, which has a streaming API.
Datanodes started complaining about too large number of blocks; they seem
to tolerate up to 500k blocks, which means that average block size has to
be around 72MB to fully utilize the cluster and avoid complaining datanodes.

On the surface it seems that I should conclude that HBase/HDFS is no good
for NAS replacement and move on. But then, the HBase API currently seems to
be the only thing getting in my way.

I checked async HBase projects, but apparently they're focused on running
the requests in background, rather than returning results earlier. HBase
streaming on Google returns just references to Spark.

HBase JIRA has a few apparently related issues:
https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with no
development yet, and https://issues.apache.org/jira/browse/HBASE-13467
seems to have died already.

I captured the network traffic between the client and the region server
when I requested one cell, and writing a custom client seems easy enough.
Are there any reasons other than the API that justify the 10MB limit on

2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:

> Have you read:
> http://hbase.apache.org/book.html#hbase_mob
> In particular:
> When using MOBs, ideally your objects will be between 100KB and 10MB
> Cheers
> On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <djelinski1@gmail.com>
> wrote:
> > Hello,
> > I'm evaluating HBase as a cheaper replacement for NAS as a file storage
> > medium. To that end I have a cluster of 5 machines, 36TB HDD each; I'm
> > planning to initially store ~240 million files of size 1KB-100MB, total
> > size 30TB. Currently I'm storing each file under an individual column,
> and
> > I group related documents in the same row. The files from the same row
> will
> > be served one at a time, but updated/deleted together.
> >
> > Loading the data to the cluster went pretty well; I enabled MOB on the
> > table and have ~50 regions per machine. Writes to the table are done by
> an
> > automated process, and cluster's performance in that area is more than
> > sufficient. On the other hand, reads are interactive, as the files are
> > served to human users over HTTP.
> >
> > Now. HBase Get in Java API is an atomic operation in the sense that it
> does
> > not complete until all data is retrieved from the server. It takes 100 ms
> > to retrieve a 1MB cell (file), and only after retrieving I am able to
> start
> > serving it to the end user. For larger cells the wait time is even
> longer,
> > and response times longer than 100 ms are bad for user experience. I
> would
> > like to start streaming the file over HTTP as soon as possible.
> >
> > What's the recommended approach to avoid or reduce the delay between when
> > HBase starts sending the response and when the application can act on it?
> > Thanks,
> > Daniel
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message