hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: HBase as a file repository
Date Fri, 31 Mar 2017 17:25:59 GMT
Use HBase as a file system meta storage (index), keep files in a large
blobs on hdfs, have periodic compaction/cleaning M/R job
to purge deleted files. You can even keep multiple versions of files.

-Vlad

On Thu, Mar 30, 2017 at 11:22 PM, Jingcheng Du <dujingch@gmail.com> wrote:

> Hi Daniel,
>
> I think it is because the memory burden in both clients and servers.
> If we have a row with large size, we have to have a hfile block with a
> large size which will heavy the burden of the block cache if the data block
> would be cached. And in scanning, both region servers and clients will take
> many memories to cache the rows.
> As you know HBase uses memstore to store data before flushing them to
> disks. A heavy write load will lead to more flush and compaction with rows
> in larger sizes than small ones.
>
> Actually we don't have hard limitation in code for the data size, you can
> store data that is larger than 10MB. You can try it if it works for you.
>
> Regards,
> Jingcheng
>
> 2017-03-31 12:25 GMT+08:00 Daniel Jeliński <djelinski1@gmail.com>:
>
> > Thank you Ted for your response.
> >
> > I have read that part of HBase book. It never explained why objects over
> > 10MB are no good, and did not suggest an alternative storage medium for
> > these.
> >
> > I have also read this:
> > http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> > And yet I'm trying to put 36TB on a machine. I certainly hope that the
> > number of region servers is the only real limiter to this.
> >
> > I tried putting files larger than 1MB on HDFS, which has a streaming API.
> > Datanodes started complaining about too large number of blocks; they seem
> > to tolerate up to 500k blocks, which means that average block size has to
> > be around 72MB to fully utilize the cluster and avoid complaining
> > datanodes.
> >
> > On the surface it seems that I should conclude that HBase/HDFS is no good
> > for NAS replacement and move on. But then, the HBase API currently seems
> to
> > be the only thing getting in my way.
> >
> > I checked async HBase projects, but apparently they're focused on running
> > the requests in background, rather than returning results earlier. HBase
> > streaming on Google returns just references to Spark.
> >
> > HBase JIRA has a few apparently related issues:
> > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with
> no
> > development yet, and https://issues.apache.org/jira/browse/HBASE-13467
> > seems to have died already.
> >
> > I captured the network traffic between the client and the region server
> > when I requested one cell, and writing a custom client seems easy enough.
> > Are there any reasons other than the API that justify the 10MB limit on
> > MOBs?
> > Thanks,
> > Daniel
> >
> >
> >
> > 2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> >
> > > Have you read:
> > > http://hbase.apache.org/book.html#hbase_mob
> > >
> > > In particular:
> > >
> > > When using MOBs, ideally your objects will be between 100KB and 10MB
> > >
> > > Cheers
> > >
> > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <djelinski1@gmail.com
> >
> > > wrote:
> > >
> > > > Hello,
> > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> storage
> > > > medium. To that end I have a cluster of 5 machines, 36TB HDD each;
> I'm
> > > > planning to initially store ~240 million files of size 1KB-100MB,
> total
> > > > size 30TB. Currently I'm storing each file under an individual
> column,
> > > and
> > > > I group related documents in the same row. The files from the same
> row
> > > will
> > > > be served one at a time, but updated/deleted together.
> > > >
> > > > Loading the data to the cluster went pretty well; I enabled MOB on
> the
> > > > table and have ~50 regions per machine. Writes to the table are done
> by
> > > an
> > > > automated process, and cluster's performance in that area is more
> than
> > > > sufficient. On the other hand, reads are interactive, as the files
> are
> > > > served to human users over HTTP.
> > > >
> > > > Now. HBase Get in Java API is an atomic operation in the sense that
> it
> > > does
> > > > not complete until all data is retrieved from the server. It takes
> 100
> > ms
> > > > to retrieve a 1MB cell (file), and only after retrieving I am able to
> > > start
> > > > serving it to the end user. For larger cells the wait time is even
> > > longer,
> > > > and response times longer than 100 ms are bad for user experience. I
> > > would
> > > > like to start streaming the file over HTTP as soon as possible.
> > > >
> > > > What's the recommended approach to avoid or reduce the delay between
> > when
> > > > HBase starts sending the response and when the application can act on
> > it?
> > > > Thanks,
> > > > Daniel
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message