hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jingcheng Du <dujin...@gmail.com>
Subject Re: HBase as a file repository
Date Wed, 05 Apr 2017 05:48:34 GMT
>>Assuming mob path is tiny compared to content, do we try to pin blob's metadata
in memory, so that their blocks don't get thrashed by actual MOB blob data
blocks?
Yes, we do. The block cache for the metadata (file path) can be set in the
scan, and the block cache for mob files is in another setting.

>>Are compaction not keeping up increasing read amplification? Trace would
be useful to be able to see where the time is spent.
In trunk, we don't archive the compacted files once after the compaction,
and we don't need to re-seek at that time, but the write lock after the
compaction might block new scanners for some time.
You are right, hrace is useful in such cases, I am considering to add it.
Thanks!

2017-04-05 12:09 GMT+08:00 Mikhail Antonov <olorinbant@gmail.com>:

> For usecases of that kind we probably should't expect very high block cache
> hit rate for data blocks,
> but assuming that bloom and index blocks are in memory, 100ms (if that's
> not for high percentile) is still a bit high.
>
> >>"it reads the MOB file path from one region"
>
> Assuming mob path is tiny compared to content, do we try to pin blob's
> metadata in memory, so that their blocks
> don't get thrashed by actual MOB blob data blocks?
>
> Are compaction not keeping up increasing read amplification? Trace would be
> useful to be able to see where the time is spent.
>
>
> On Tue, Apr 4, 2017 at 7:20 PM, Jingcheng Du <dujingch@gmail.com> wrote:
>
> > >>Do we have htrace probes on that codepath? Do we do more seeks that we
> > expect?
> > Thanks Mikhail. We don't have htrace on that code path now.
> > In reading, it reads the MOB file path from one region and then read the
> > MOB data from that MOB file, and by default the cache on the data block
> in
> > MOB files are disabled. Two hfile seeking totally. And gc is another
> > reason.
> >
> >
> > 2017-04-05 3:44 GMT+08:00 Mikhail Antonov <olorinbant@gmail.com>:
> >
> > > Unfamiliar with MOB codebase but reading.. " It takes 100 ms
> > > to retrieve a 1MB cell (file), and only after retrieving I am able to
> > start
> > > serving it to the end user".. Is that avg, p90, p99?
> > >
> > > Do we have htrace probes on that codepath? Do we do more seeks that we
> > > expect?
> > >
> > > -Mikhail
> > >
> > > On Tue, Apr 4, 2017 at 11:23 AM, Stack <stack@duboce.net> wrote:
> > >
> > > > On Thu, Mar 30, 2017 at 9:25 PM, Daniel Jeliński <
> djelinski1@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Thank you Ted for your response.
> > > > >
> > > > > I have read that part of HBase book. It never explained why objects
> > > over
> > > > > 10MB are no good, and did not suggest an alternative storage medium
> > for
> > > > > these.
> > > > >
> > > > >
> > > > Thats a hole. I filed HBASE-17875.
> > > >
> > > > The 10MB upper-bound is a conservative upper bound. Bigger Cells will
> > > skirt
> > > > buffer pools so we'll do one-off allocations per read. The GC will
> > > > experience a shock, and so on.
> > > >
> > > >
> > > > > I have also read this:
> > > > > http://hbase.apache.org/book.html#regionserver_sizing_
> rules_of_thumb
> > > > > And yet I'm trying to put 36TB on a machine. I certainly hope that
> > the
> > > > > number of region servers is the only real limiter to this.
> > > > >
> > > > >
> > > > In the refguide, the guidance is intentionally conservative. It is
> > > probably
> > > > also stale at this point. Most users/devs do not do the degree of
> > PoC'ing
> > > > that you have. The recommendations are more for the latter than you.
> > > >
> > > >  ...
> > > >
> > > >
> > > > > I checked async HBase projects, but apparently they're focused on
> > > running
> > > > > the requests in background, rather than returning results earlier.
> > > HBase
> > > > > streaming on Google returns just references to Spark.
> > > > >
> > > > > HBase JIRA has a few apparently related issues:
> > > > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh
> > with
> > > > no
> > > > > development yet, and https://issues.apache.org/
> > jira/browse/HBASE-13467
> > > > > seems to have died already.
> > > > >
> > > > >
> > > > I pinged on HBASE-13467. My understanding was that this project was
> > > > underway...
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > >
> > > > > I captured the network traffic between the client and the region
> > server
> > > > > when I requested one cell, and writing a custom client seems easy
> > > enough.
> > > > > Are there any reasons other than the API that justify the 10MB
> limit
> > on
> > > > > MOBs?
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > > >
> > > > >
> > > > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> > > > >
> > > > > > Have you read:
> > > > > > http://hbase.apache.org/book.html#hbase_mob
> > > > > >
> > > > > > In particular:
> > > > > >
> > > > > > When using MOBs, ideally your objects will be between 100KB
and
> > 10MB
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> > > djelinski1@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > > I'm evaluating HBase as a cheaper replacement for NAS as
a file
> > > > storage
> > > > > > > medium. To that end I have a cluster of 5 machines, 36TB
HDD
> > each;
> > > > I'm
> > > > > > > planning to initially store ~240 million files of size
> 1KB-100MB,
> > > > total
> > > > > > > size 30TB. Currently I'm storing each file under an individual
> > > > column,
> > > > > > and
> > > > > > > I group related documents in the same row. The files from
the
> > same
> > > > row
> > > > > > will
> > > > > > > be served one at a time, but updated/deleted together.
> > > > > > >
> > > > > > > Loading the data to the cluster went pretty well; I enabled
MOB
> > on
> > > > the
> > > > > > > table and have ~50 regions per machine. Writes to the table
are
> > > done
> > > > by
> > > > > > an
> > > > > > > automated process, and cluster's performance in that area
is
> more
> > > > than
> > > > > > > sufficient. On the other hand, reads are interactive, as
the
> > files
> > > > are
> > > > > > > served to human users over HTTP.
> > > > > > >
> > > > > > > Now. HBase Get in Java API is an atomic operation in the
sense
> > that
> > > > it
> > > > > > does
> > > > > > > not complete until all data is retrieved from the server.
It
> > takes
> > > > 100
> > > > > ms
> > > > > > > to retrieve a 1MB cell (file), and only after retrieving
I am
> > able
> > > to
> > > > > > start
> > > > > > > serving it to the end user. For larger cells the wait time
is
> > even
> > > > > > longer,
> > > > > > > and response times longer than 100 ms are bad for user
> > experience.
> > > I
> > > > > > would
> > > > > > > like to start streaming the file over HTTP as soon as possible.
> > > > > > >
> > > > > > > What's the recommended approach to avoid or reduce the
delay
> > > between
> > > > > when
> > > > > > > HBase starts sending the response and when the application
can
> > act
> > > on
> > > > > it?
> > > > > > > Thanks,
> > > > > > > Daniel
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Michael Antonov
> > >
> >
>
>
>
> --
> Thanks,
> Michael Antonov
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message