hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11339) HBase MOB
Date Tue, 02 Sep 2014 20:10:23 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118659#comment-14118659

Jonathan Hsieh commented on HBASE-11339:

Lars and I chatted back on Thursday, and agree about the solution for truly large objects
(larger than default hdfs block) would require a new streaming API.  We also talked about
the importance of making configuration and operations simple.  

[~lhofhansl], can you describe the scale and the load for the hybrid MOB storage system you
are have or are working on?  It is new to me, and I'd very curious about how things like backups
and bulk loads are handled in that system.  

Details below:


The fundamental problem this MOB solution is addressing is the a balance between is the hdfs
small file problem and write amplification and performance variability caused by write amplification.
 Objects that have greater than 64MB+ values are out of scope and we are in agreement about
needing a streaming api and that a hdfs+hbase solution seems more reasonable.  The goal with
the MOB mechanism is to show demonstrable improvements in predictability and scalability when
we are too small for where hdfs makes sense and where hbase is non-optimal due to splits and

In some workloads I've been seeing, (timeseries large sensor dumps, mini indexes, or binary
documents or images as blob cells) this feature would potentially be very helpful.

bq. We still cannot stream the mobs. They have to be materialized at both the server and the
client (going by the documentation here)

This is true; however this is *not* the design point we are trying to solve.

bq. As I state above this can be achieved with a HBase/HDFS client alone and better: Store
mobs up to a certain size by value in HBase (say 5 or 10mb or so), everything larger goes
straight into HDFS with a reference only in HBase. This addresses both the many small files
issue in HDFS (only files larger than 5-10mb would end up in HDFS) and the streaming problem
for large files in HBase. Also as outlined by me in June we can still make this "transactional"
in the HBase sense with a three step protocol: (1) write reference row, (2) stream blob to
HDFS, (3) record location in HDFS (that's the commit). This solution is also missing from
the initial PDF in the "Existing Solutions" section.

Back in June, JingCheng's response to your comments never got feedback on how you'd manage
the small files problem.

Also, in there are two HDFS blob + HBase metadata solutions are explicitly mentioned in section
4.1.2 (v4 design doc) with pros and cons.  The solution you propose is actually the first
described hdfs+hbase approach -- though its pro's and con's don't go into the particulars
of the commit protocol (though the two-phase prep and then commit be the commit make sense).
 The largest concern was in the doc as well -- the HDFS small files problem.

Having a separate hdfs file per 100k-10mb value is not a scaleable or long term solution.
 Let's do an example -- lets say we wrote 200M 500KB blobs.  This ends up being 100TB of data.

* Using hbase as is, we end up with objects in potentially in 10,000 10GB files.
** Along the way, we'd end up splitting and compacting every 20,000 objects, rewriting large
chunks of the 100TB over and over.
* Using hdfs+hbase, we'd end up with a 200M files -- a lot more files than the optimal 10000
files that vanilla hbase approach could eventually compact to.
** 200M files would consume ~200GB ram for block records in the NN (200M files * 3 block replicas
per file * ~300 bytes per hdfs inode+ blockinfo [1][2]  ->  ~200GB), which is definitely
in an  uncharted area for NN's  -- a place where there would likely be GC problems and other
negative affects.

bq. "Replication" here can still happen by the client, after all, each file successfully stored
in HDFS has a reference in HBase.

The design doc approach actually minimizes the operational changes required to store MOBs.
 From an operational point of view, a users could just enable the optional feature and potentially
take advantage of its potential benefits.

This hdfs+hbase proposed approach actually pushes more complexity into the replication mechanism.
 For replication to work, the source cluster would now need to add mechanisms to open the
mobs on the hdfs and ship them off to the other cluster.  The MOB approach is simpler operationally
and in code because it can use the normal replication mechanism.

The hdfs+hbase proposed approach would need updates and a new bulk load mechanism.  The MOB
approach is simpler operationally and in code because it can would use normal bulk loads and
compactions would push out eventually push the mobs out.  (same IO cost)

The hdfs+hbase proposed approach would need updates to properly handle table snapshots and
restoring table snapshots and backups.  Naively this seems like we'd either have to do a lot
of NN operations to backup the mobs. (1 per mob).  Also, we'd need to build new tools to manage
export and copy table operations as well.    The MOB approach is simpler because the copy/export
table mechanisms remain the same, and we can use the same archiver mechanism to manage mob
file snapshots (mobs are essentially stored in a special region).

bq. We should use the tools what they were intended for. HBase for key value storage, HDFS
for streaming large blobs.

We agree here.

The case for HDFS is weak: 1MB-5MB blobs are not large enough for HDFS -- in fact for HDFS
this is nearly pathological.

The case for HBase is ok: We are writing key-values that tend to be larger than normal.  However,
with constant continuous ingest with 1MB-5MB MOBs will likely cause trigger splits more frequently
which will trigger unavoidable major compactions.  This would occur even if bulk load mechanisms
were used.

The case for HBase + MOB is stronger:  We are writing key-values that tend to be large.  The
bulk of the 1MB-5MB MOB data is written off to the MOB path.  The metadata for the mob  (let's
say 100 bytes per) is relatively small and thus compactions and splits will be much rarer
(10000x less frequent) than the hbase-only approach.  If bulk loads are done, an initial compaction
would separate out the mob data and keep the region relatively small.

bq. Just saying using one client API for client convenience is not a reason to put all of
this into HBase. A client can easily speak both HBase and HDFS protocols.

The tradeoff being made by the hdfs+hbase approach is opting for more operational complexity
vs implementation simplicity.  With the hdfs+hbase approach, we'd also introduce new  security
issues -- now users in hbase would have the ability to modify the file system directly, and
now manage users and credentials on both hdfs and hbase in sync.  With the MOB approach we
just rely on HBase's security mechanism, its and should be able have per cell ACLs, vis tags,
and all the rest.

Given the choice of 1) making hbase simpler to use by adding some internal complexity or 2)
making hbase more operationally difficult by adding new external processes and requiring external
integrations to manage parts of its data, we should opt for 1.  Making HBase easier to use
by removing knobs or making knobs as simple as possible should be the priority.

bq. (Subjectively) I do not like the complexity of this as seen by the various discussions
here. That part is just my $0.02 of course.

We agree about not liking complexity.  However, the discussion process was public and we described
and knocked down several strawmen. I actually initially took the side against adding this
feature but have been convinced me that when complete, this would have light operator impact
and actually less complex than a full solution that uses the hybrid hdfs+hbase approach.

[1] https://issues.apache.org/jira/browse/HDFS-6658
[2] https://issues.apache.org/jira/secure/attachment/12651408/Block-Manager-as-a-Service.pdf
(see RAM Explosion section)

> HBase MOB
> ---------
>                 Key: HBASE-11339
>                 URL: https://issues.apache.org/jira/browse/HBASE-11339
>             Project: HBase
>          Issue Type: Umbrella
>          Components: regionserver, Scanners
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase MOB Design-v4.pdf,
HBase MOB Design.pdf, MOB user guide.docx, MOB user guide_v2.docx, hbase-11339-in-dev.patch
>   It's quite useful to save the medium binary data like images, documents into Apache
HBase. Unfortunately directly saving the binary MOB(medium object) to HBase leads to a worse
performance since the frequent split and compaction.
>   In this design, the MOB data are stored in an more efficient way, which keeps a high
write/read performance and guarantees the data consistency in Apache HBase.

This message was sent by Atlassian JIRA

View raw message