hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jingcheng Du <dujin...@gmail.com>
Subject Re: HBase as a file repository
Date Wed, 05 Apr 2017 02:20:14 GMT
>>Do we have htrace probes on that codepath? Do we do more seeks that we
expect?
Thanks Mikhail. We don't have htrace on that code path now.
In reading, it reads the MOB file path from one region and then read the
MOB data from that MOB file, and by default the cache on the data block in
MOB files are disabled. Two hfile seeking totally. And gc is another reason.


2017-04-05 3:44 GMT+08:00 Mikhail Antonov <olorinbant@gmail.com>:

> Unfamiliar with MOB codebase but reading.. " It takes 100 ms
> to retrieve a 1MB cell (file), and only after retrieving I am able to start
> serving it to the end user".. Is that avg, p90, p99?
>
> Do we have htrace probes on that codepath? Do we do more seeks that we
> expect?
>
> -Mikhail
>
> On Tue, Apr 4, 2017 at 11:23 AM, Stack <stack@duboce.net> wrote:
>
> > On Thu, Mar 30, 2017 at 9:25 PM, Daniel Jeliński <djelinski1@gmail.com>
> > wrote:
> >
> > > Thank you Ted for your response.
> > >
> > > I have read that part of HBase book. It never explained why objects
> over
> > > 10MB are no good, and did not suggest an alternative storage medium for
> > > these.
> > >
> > >
> > Thats a hole. I filed HBASE-17875.
> >
> > The 10MB upper-bound is a conservative upper bound. Bigger Cells will
> skirt
> > buffer pools so we'll do one-off allocations per read. The GC will
> > experience a shock, and so on.
> >
> >
> > > I have also read this:
> > > http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> > > And yet I'm trying to put 36TB on a machine. I certainly hope that the
> > > number of region servers is the only real limiter to this.
> > >
> > >
> > In the refguide, the guidance is intentionally conservative. It is
> probably
> > also stale at this point. Most users/devs do not do the degree of PoC'ing
> > that you have. The recommendations are more for the latter than you.
> >
> >  ...
> >
> >
> > > I checked async HBase projects, but apparently they're focused on
> running
> > > the requests in background, rather than returning results earlier.
> HBase
> > > streaming on Google returns just references to Spark.
> > >
> > > HBase JIRA has a few apparently related issues:
> > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with
> > no
> > > development yet, and https://issues.apache.org/jira/browse/HBASE-13467
> > > seems to have died already.
> > >
> > >
> > I pinged on HBASE-13467. My understanding was that this project was
> > underway...
> >
> > St.Ack
> >
> >
> >
> >
> > > I captured the network traffic between the client and the region server
> > > when I requested one cell, and writing a custom client seems easy
> enough.
> > > Are there any reasons other than the API that justify the 10MB limit on
> > > MOBs?
> > > Thanks,
> > > Daniel
> > >
> > >
> > >
> > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > > > Have you read:
> > > > http://hbase.apache.org/book.html#hbase_mob
> > > >
> > > > In particular:
> > > >
> > > > When using MOBs, ideally your objects will be between 100KB and 10MB
> > > >
> > > > Cheers
> > > >
> > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> djelinski1@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> > storage
> > > > > medium. To that end I have a cluster of 5 machines, 36TB HDD each;
> > I'm
> > > > > planning to initially store ~240 million files of size 1KB-100MB,
> > total
> > > > > size 30TB. Currently I'm storing each file under an individual
> > column,
> > > > and
> > > > > I group related documents in the same row. The files from the same
> > row
> > > > will
> > > > > be served one at a time, but updated/deleted together.
> > > > >
> > > > > Loading the data to the cluster went pretty well; I enabled MOB on
> > the
> > > > > table and have ~50 regions per machine. Writes to the table are
> done
> > by
> > > > an
> > > > > automated process, and cluster's performance in that area is more
> > than
> > > > > sufficient. On the other hand, reads are interactive, as the files
> > are
> > > > > served to human users over HTTP.
> > > > >
> > > > > Now. HBase Get in Java API is an atomic operation in the sense that
> > it
> > > > does
> > > > > not complete until all data is retrieved from the server. It takes
> > 100
> > > ms
> > > > > to retrieve a 1MB cell (file), and only after retrieving I am able
> to
> > > > start
> > > > > serving it to the end user. For larger cells the wait time is even
> > > > longer,
> > > > > and response times longer than 100 ms are bad for user experience.
> I
> > > > would
> > > > > like to start streaming the file over HTTP as soon as possible.
> > > > >
> > > > > What's the recommended approach to avoid or reduce the delay
> between
> > > when
> > > > > HBase starts sending the response and when the application can act
> on
> > > it?
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> Michael Antonov
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message