hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jingcheng Du <dujin...@gmail.com>
Subject Re: HBase as a file repository
Date Fri, 31 Mar 2017 06:22:57 GMT
Hi Daniel,

I think it is because the memory burden in both clients and servers.
If we have a row with large size, we have to have a hfile block with a
large size which will heavy the burden of the block cache if the data block
would be cached. And in scanning, both region servers and clients will take
many memories to cache the rows.
As you know HBase uses memstore to store data before flushing them to
disks. A heavy write load will lead to more flush and compaction with rows
in larger sizes than small ones.

Actually we don't have hard limitation in code for the data size, you can
store data that is larger than 10MB. You can try it if it works for you.

Regards,
Jingcheng

2017-03-31 12:25 GMT+08:00 Daniel Jeliński <djelinski1@gmail.com>:

> Thank you Ted for your response.
>
> I have read that part of HBase book. It never explained why objects over
> 10MB are no good, and did not suggest an alternative storage medium for
> these.
>
> I have also read this:
> http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> And yet I'm trying to put 36TB on a machine. I certainly hope that the
> number of region servers is the only real limiter to this.
>
> I tried putting files larger than 1MB on HDFS, which has a streaming API.
> Datanodes started complaining about too large number of blocks; they seem
> to tolerate up to 500k blocks, which means that average block size has to
> be around 72MB to fully utilize the cluster and avoid complaining
> datanodes.
>
> On the surface it seems that I should conclude that HBase/HDFS is no good
> for NAS replacement and move on. But then, the HBase API currently seems to
> be the only thing getting in my way.
>
> I checked async HBase projects, but apparently they're focused on running
> the requests in background, rather than returning results earlier. HBase
> streaming on Google returns just references to Spark.
>
> HBase JIRA has a few apparently related issues:
> https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with no
> development yet, and https://issues.apache.org/jira/browse/HBASE-13467
> seems to have died already.
>
> I captured the network traffic between the client and the region server
> when I requested one cell, and writing a custom client seems easy enough.
> Are there any reasons other than the API that justify the 10MB limit on
> MOBs?
> Thanks,
> Daniel
>
>
>
> 2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
>
> > Have you read:
> > http://hbase.apache.org/book.html#hbase_mob
> >
> > In particular:
> >
> > When using MOBs, ideally your objects will be between 100KB and 10MB
> >
> > Cheers
> >
> > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <djelinski1@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I'm evaluating HBase as a cheaper replacement for NAS as a file storage
> > > medium. To that end I have a cluster of 5 machines, 36TB HDD each; I'm
> > > planning to initially store ~240 million files of size 1KB-100MB, total
> > > size 30TB. Currently I'm storing each file under an individual column,
> > and
> > > I group related documents in the same row. The files from the same row
> > will
> > > be served one at a time, but updated/deleted together.
> > >
> > > Loading the data to the cluster went pretty well; I enabled MOB on the
> > > table and have ~50 regions per machine. Writes to the table are done by
> > an
> > > automated process, and cluster's performance in that area is more than
> > > sufficient. On the other hand, reads are interactive, as the files are
> > > served to human users over HTTP.
> > >
> > > Now. HBase Get in Java API is an atomic operation in the sense that it
> > does
> > > not complete until all data is retrieved from the server. It takes 100
> ms
> > > to retrieve a 1MB cell (file), and only after retrieving I am able to
> > start
> > > serving it to the end user. For larger cells the wait time is even
> > longer,
> > > and response times longer than 100 ms are bad for user experience. I
> > would
> > > like to start streaming the file over HTTP as soon as possible.
> > >
> > > What's the recommended approach to avoid or reduce the delay between
> when
> > > HBase starts sending the response and when the application can act on
> it?
> > > Thanks,
> > > Daniel
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message