hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11339) HBase MOB
Date Mon, 01 Sep 2014 23:10:22 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117774#comment-14117774

Lars Hofhansl commented on HBASE-11339:

[~jon@cloudera.com] and I talked about this at the HBase meetup...

I'm sorry to be the party pooper here, but this complexity and functionality really does not
belong into HBase IMHO.
I still do not get the motivation for this... Here's why:
# We still cannot stream the mobs. They have to be materialized at both the server and the
client (going by the documentation here)
# As I state above this can be achieved with a HBase/HDFS client alone and better: Store mobs
up to a certain size by value in HBase (say 5 or 10mb or so), everything larger goes straight
into HDFS with a reference only in HBase. This addresses both the many small files issue in
HDFS (only files larger than 5-10mb would end up in HDFS) and the streaming problem for large
files in HBase. Also as outlined by me in June we can still make this "transactional" in the
HBase sense with a three step protocol: (1) write reference row, (2) stream blob to HDFS,
(3) record location in HDFS (that's the commit). This solution is also missing from the initial
PDF in the "Existing Solutions" section.
# "Replication" here can still happen by the client, after all, each file successfully stored
in HDFS has a reference in HBase.
# We should use the tools what they were intended for. HBase for key value storage, HDFS for
streaming large blobs.
# Just saying using one client API for client convenience is *not* a reason to put all of
this into HBase. A client can easily speak both HBase and HDFS protocols.
# (Subjectively) I do not like the complexity of this as seen by the various discussions here.
That part is just my $0.02 of course.

This looks to me like solution to a problem that we do not have.

Again I am sorry about being negative here, but we have to be careful what we put into HBase
and for what reasons.

Especially when there seems to be a *better* client only solution (in the sense that it can
deal with larger files, and allows for streaming the larger files).

If we need a solution for this, let's build one on top of HBase/HDFS. We (Salesforce) are
actually building a client only solution for this, it's not that difficult (I will see whether
we can open source this - it might be too entangled with our internals). With an easy protocol
we can still allow data locality for all blob reads (as much as the block distribution allows
it at least), etc.
[~jesse_yates], maybe you want to add here?

If we cannot store 10mb Cells in HBase then that's something to address. The fact that we
cannot stream into and out of HBase needs to be addressed, that is the real problem anyway.

> HBase MOB
> ---------
>                 Key: HBASE-11339
>                 URL: https://issues.apache.org/jira/browse/HBASE-11339
>             Project: HBase
>          Issue Type: Umbrella
>          Components: regionserver, Scanners
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase MOB Design-v4.pdf,
HBase MOB Design.pdf, MOB user guide.docx, MOB user guide_v2.docx, hbase-11339-in-dev.patch
>   It's quite useful to save the medium binary data like images, documents into Apache
HBase. Unfortunately directly saving the binary MOB(medium object) to HBase leads to a worse
performance since the frequent split and compaction.
>   In this design, the MOB data are stored in an more efficient way, which keeps a high
write/read performance and guarantees the data consistency in Apache HBase.

This message was sent by Atlassian JIRA

View raw message