hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Jeliński <djelins...@gmail.com>
Subject Re: HBase as a file repository
Date Wed, 05 Apr 2017 11:57:00 GMT
Thanks for all the responses!
Regarding the stats I collected, that was the average time seen from the
client side, measured by comparing results of System.nanoTime immediately
before and immediately after executing table.get(req).
However, one hard drive failed since the original test causing very long
wait times. That drive was replaced, and I collected fresh stats.
Min/max/avg represent time in milliseconds required to retrieve a document
in the bucket described in first column. Count represents the number of
docs sampled. Results were as follows:
16M+: Min: 542, max: 1293, avg: 809,count: 25
8M+: Min: 153, max: 785, avg: 379,count: 115
4M+: Min: 66, max: 677, avg: 191,count: 418
2M+: Min: 28, max: 888, avg: 119,count: 2895
1M+: Min: 13, max: 1386, avg: 73,count: 6864
512K+: Min: 7, max: 2421, avg: 55,count: 15697
256K+: Min: 4, max: 3566, avg: 47,count: 23261
128K+: Min: 2, max: 853, avg: 38,count: 13804
64K+: Min: 1, max: 741, avg: 25,count: 4251
32K+: Min: 0, max: 724, avg: 13,count: 2904
16K+: Min: 0, max: 709, avg: 18,count: 399
8K+: Min: 0, max: 722, avg: 19,count: 2885

Docs between 1MB and 2MB took 73 ms to retrieve on average. Docs between
2-4MB took 119 ms.
Testing was performed by retrieving one document at a time from a cluster
with no additional load. Each doc was retrieved only once to minimize cache
usage. There were no requests for nonexistent docs. Ping time between
client machine and region servers was 0.2 ms (0.0002 seconds), link was 1
gigabit. The table was not compacted prior to the test, so data locality
may play a role in the observed results.
Regards,
Daniel

2017-04-04 21:44 GMT+02:00 Mikhail Antonov <olorinbant@gmail.com>:

> Unfamiliar with MOB codebase but reading.. " It takes 100 ms
> to retrieve a 1MB cell (file), and only after retrieving I am able to start
> serving it to the end user".. Is that avg, p90, p99?
>
> Do we have htrace probes on that codepath? Do we do more seeks that we
> expect?
>
> -Mikhail
>
> On Tue, Apr 4, 2017 at 11:23 AM, Stack <stack@duboce.net> wrote:
>
> > On Thu, Mar 30, 2017 at 9:25 PM, Daniel Jeliński <djelinski1@gmail.com>
> > wrote:
> >
> > > Thank you Ted for your response.
> > >
> > > I have read that part of HBase book. It never explained why objects
> over
> > > 10MB are no good, and did not suggest an alternative storage medium for
> > > these.
> > >
> > >
> > Thats a hole. I filed HBASE-17875.
> >
> > The 10MB upper-bound is a conservative upper bound. Bigger Cells will
> skirt
> > buffer pools so we'll do one-off allocations per read. The GC will
> > experience a shock, and so on.
> >
> >
> > > I have also read this:
> > > http://hbase.apache.org/book.html#regionserver_sizing_rules_of_thumb
> > > And yet I'm trying to put 36TB on a machine. I certainly hope that the
> > > number of region servers is the only real limiter to this.
> > >
> > >
> > In the refguide, the guidance is intentionally conservative. It is
> probably
> > also stale at this point. Most users/devs do not do the degree of PoC'ing
> > that you have. The recommendations are more for the latter than you.
> >
> >  ...
> >
> >
> > > I checked async HBase projects, but apparently they're focused on
> running
> > > the requests in background, rather than returning results earlier.
> HBase
> > > streaming on Google returns just references to Spark.
> > >
> > > HBase JIRA has a few apparently related issues:
> > > https://issues.apache.org/jira/browse/HBASE-17721 is pretty fresh with
> > no
> > > development yet, and https://issues.apache.org/jira/browse/HBASE-13467
> > > seems to have died already.
> > >
> > >
> > I pinged on HBASE-13467. My understanding was that this project was
> > underway...
> >
> > St.Ack
> >
> >
> >
> >
> > > I captured the network traffic between the client and the region server
> > > when I requested one cell, and writing a custom client seems easy
> enough.
> > > Are there any reasons other than the API that justify the 10MB limit on
> > > MOBs?
> > > Thanks,
> > > Daniel
> > >
> > >
> > >
> > > 2017-03-31 0:03 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > > > Have you read:
> > > > http://hbase.apache.org/book.html#hbase_mob
> > > >
> > > > In particular:
> > > >
> > > > When using MOBs, ideally your objects will be between 100KB and 10MB
> > > >
> > > > Cheers
> > > >
> > > > On Thu, Mar 30, 2017 at 1:01 PM, Daniel Jeliński <
> djelinski1@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > I'm evaluating HBase as a cheaper replacement for NAS as a file
> > storage
> > > > > medium. To that end I have a cluster of 5 machines, 36TB HDD each;
> > I'm
> > > > > planning to initially store ~240 million files of size 1KB-100MB,
> > total
> > > > > size 30TB. Currently I'm storing each file under an individual
> > column,
> > > > and
> > > > > I group related documents in the same row. The files from the same
> > row
> > > > will
> > > > > be served one at a time, but updated/deleted together.
> > > > >
> > > > > Loading the data to the cluster went pretty well; I enabled MOB on
> > the
> > > > > table and have ~50 regions per machine. Writes to the table are
> done
> > by
> > > > an
> > > > > automated process, and cluster's performance in that area is more
> > than
> > > > > sufficient. On the other hand, reads are interactive, as the files
> > are
> > > > > served to human users over HTTP.
> > > > >
> > > > > Now. HBase Get in Java API is an atomic operation in the sense that
> > it
> > > > does
> > > > > not complete until all data is retrieved from the server. It takes
> > 100
> > > ms
> > > > > to retrieve a 1MB cell (file), and only after retrieving I am able
> to
> > > > start
> > > > > serving it to the end user. For larger cells the wait time is even
> > > > longer,
> > > > > and response times longer than 100 ms are bad for user experience.
> I
> > > > would
> > > > > like to start streaming the file over HTTP as soon as possible.
> > > > >
> > > > > What's the recommended approach to avoid or reduce the delay
> between
> > > when
> > > > > HBase starts sending the response and when the application can act
> on
> > > it?
> > > > > Thanks,
> > > > > Daniel
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> Michael Antonov
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message