Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.103 as permitted sender)
Message-ID: <BLU437-SMTP613D68BD971D15D7E3430B8F020@phx.gbl>
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\))
Subject: Re: Standalone == Dev Only?
From: Michael Segel <michael_segel@hotmail.com>
In-Reply-To: 
 <CAGHyZ6KoHc8RBFuw5M3n5U90XPj_KSFjYYADo_QSCYs-Apjn5w@mail.gmail.com>
Date: Mon, 16 Mar 2015 09:10:29 -0500
Content-Transfer-Encoding: quoted-printable
References: <D11B3E4E.269%Joseph.Rose@childrens.harvard.edu>
 <CADcMMgERf5a--Dc+-SW0OvV_Vn09YGMQLEqpRWTNvC4+JUfxFA@mail.gmail.com>
 <54FA078B.90009@gmail.com> <D11F8986.3C0%Joseph.Rose@childrens.harvard.edu>
 <CADcMMgFprLp__nEA3=d5Rk-ESa1QMM3BHAUGsLzxj7K5ATPAtQ@mail.gmail.com>
 <D12478CB.40A%Joseph.Rose@childrens.harvard.edu>
 <D12488ED.4A2%Joseph.Rose@childrens.harvard.edu>
 <CAA7+SiALMpu8dGgo-Axp3kUGrEOBVaT8buecKR6oaDabXB2s4A@mail.gmail.com>
 <CANZa=Gv=iS01gdw5u1VekMRybERhwvBu-pjpzXN96MPxr8i-bQ@mail.gmail.com>
 <BLU437-SMTP240EBBCB66909F99CC60228F070@phx.gbl>
 <D128A284.5D1%Joseph.Rose@childrens.harvard.edu>
 <BLU436-SMTP1734D1B9B4CD95ADCB30D778F070@phx.gbl>
 <CAGHyZ6KoHc8RBFuw5M3n5U90XPj_KSFjYYADo_QSCYs-Apjn5w@mail.gmail.com>
To: user@hbase.apache.org

I guess the old adage is true.=20

If you only have a hammer, then every problem looks like a nail.=20
As an architect, its your role to find the right tools to be used to =
solve the problem in the most efficient and effective manner. =20
So the first question you need to ask is if HBase is the right tool.=20

The OP=E2=80=99s project isn=E2=80=99t one that should be put in to =
HBase.=20
Velocity? Volume? Variety?=20

These are the three aspects of Big Data and they can also be used to =
test if a problem should be solved using HBase. You don=E2=80=99t need =
all three, but you should have at least two of the three if you have a =
good candidate.=20

The other thing to consider is how you plan on using the data. If =
you=E2=80=99re not using M/R or HDFS, then you don=E2=80=99t want to use =
HBase in production.=20

And as a good architect, you want to take the inverse of the problem and =
ask why not a Relational Database, or an existing Hierarchical Database.=20=

(Both technologies have been around 30+ years.) And it turns out that =
you can=20

So the OP=E2=80=99s problem lacks the volume.=20
It also lacks the variety.=20

So if we ask a simple question of how to use an RDBMS to handle this=E2=80=
=A6 its pretty straight forward.=20

Store the medical record(s) in either XML or JSON format.=20

On ingestion, copy out only the fields required to identify an unique =
record.  That=E2=80=99s your base record storage.=20

Indexing could be done one of two ways.=20
1) You could use an inverted table.=20
2) You could copy out the field to be used in the index as a column and =
then index that column.=20

If you use an inverted table, your schema design would translate in to =
HBase.=20

Then when you access the data, you use the index to find the result set =
and for each record, you have the JSON object that you can use as a =
whole or just components.=20

The pattern of storing the record in a single column as  Text LOB and =
then creating indexes to identify and locate the records isn=E2=80=99t =
new. I=E2=80=99ve used it at a client over 15 yrs ago for an ODS =
implementation.=20

In terms of HBase=E2=80=A6=20
Stability depends on the hardware, admin and the use cases. Its still =
relatively unstable.  In most cases no where near 4 9=E2=80=99s.=20

Considering that there is also the regulatory compliance issues =E2=80=A6 =
e.g. security=E2=80=A6 This alone will rule HBase out in a stand alone =
situation and again even with Kerberos implemented, you may not meet =
your security requirements.=20

Bottom line, the OP is going to do what he=E2=80=99s going to do. All I =
can do is tell him its not a good idea, and why.=20

This email thread is great column fodder for a blog as well as for a =
presentation as to why/why not HBase and Hadoop.  Its something that =
should be included in a design lecture or lectures, but unfortunately, =
most of the larger conferences are driven by the vendors who have their =
own agendas and slots that they want to fill with marketing talks.=20

BTW, I am really curious as to how if the OP is using a standalone =
instance of HBase does the immature HDFS encryption help secure his =
data?  ;-)=20

HTH

-Mike


> On Mar 13, 2015, at 3:44 PM, Sean Busbey <busbey@cloudera.com> wrote:
>=20
> On Fri, Mar 13, 2015 at 2:41 PM, Michael Segel =
<michael_segel@hotmail.com>
> wrote:
>=20
>>=20
>> In stand alone, you=E2=80=99re writing to local disk. You lose the =
disk you lose
>> the data, unless of course you=E2=80=99ve raided your drives.
>> Then when you lose the node, you lose the data because its not being
>> replicated. While this may not be a major issue or concern=E2=80=A6 =
you have to be
>> aware of it=E2=80=99s potential.
>>=20
>>=20
> It sounds like he has this issue covered via VM imaging.
>=20
>=20
>=20
>> The other issue when it comes to security, HBase relies on the =
cluster=E2=80=99s
>> security.
>> To be clear, HBase relies on the cluster and the use of Kerberos to =
help
>> with authentication.  So that only those who have the rights to see =
the
>> data can actually have access to it.
>>=20
>>=20
>=20
> He can get around this by relying on the Thrift or REST services to =
act an
> an arbitrator, or he could make his own. So long as he separates =
access to
> the underlying cluster / hbase apis from whatever does exposing the =
data,
> this shouldn't be a problem.
>=20
>=20
>=20
>> Then you have to worry about auditing. With respect to HBase, out of =
the
>> box, you don=E2=80=99t have any auditing.
>>=20
>>=20
>=20
> HBase has auditing. By default it is disabled and it certainly could =
use
> some improvement. Documentation would be a good start. I'm sure the
> community would be happy to work with Joseph to close whatever gap he =
needs.
>=20
>=20
>=20
>=20
>> You also don=E2=80=99t have built in encryption.
>> You can do it, but then you have a bit of work ahead of you.
>> Cell level encryption? Accumulo?
>>=20
>>=20
> HBase as had encryption since within the 0.98 line. It is stable now =
in the
> 1.0 release line. HDFS also supports encryption, though I'm sure using =
it
> with the LocalFileSystem would benefit from testing. There are vendors =
that
> can help with integration with proper key servers, if that is =
something
> Joseph needs and doesn't want to do on his own.
>=20
> Accumulo does not do cell level encryption.
>=20
>=20
>=20
>> There=E2=80=99s definitely more to it.
>>=20
>> But the one killer thing=E2=80=A6 you need to be HIPPA compliant and =
the simplest
>> way to do this is to use a real RDBMS. If you need extensibility, =
look at
>> IDS from IBM (IBM bought Informix ages ago.)
>>=20
>> I think based on the size of your data=E2=80=A6 you can get away with =
the free
>> version, and even if not, IBM does do discounts with Universities and =
could
>> even sponsor research projects.
>>=20
>> I don=E2=80=99t know your data, but 10^6 rows is still small.
>>=20
>> The point I=E2=80=99m trying to make is that based on what you=E2=80=99=
ve said, HBase is
>> definitely not the right database for you.
>>=20
>>=20
> We haven't heard what the target data set size is. If Joseph has =
reason to
> believe that it will be big enough to warrant something like HBase =
(e.g.
> 10s of billions of rows), I think there's merit to his argument for
> starting with HBase. Single node use cases are definitely not =
something
> we've covered well to date, but it would probably help our overall
> usability story to do so.
>=20
>=20
> --=20
> Sean

The opinions expressed here are mine, while they may reflect a cognitive =
thought, that is purely accidental.=20
Use at your own risk.=20
Michael Segel
michael_segel (AT) hotmail.com