Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 72C2E17CBB for ; Mon, 16 Mar 2015 14:11:07 +0000 (UTC) Received: (qmail 71576 invoked by uid 500); 16 Mar 2015 14:11:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 71504 invoked by uid 500); 16 Mar 2015 14:11:05 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 71491 invoked by uid 99); 16 Mar 2015 14:11:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2015 14:11:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.103 as permitted sender) Received: from [65.55.111.103] (HELO BLU004-OMC2S28.hotmail.com) (65.55.111.103) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2015 14:10:53 +0000 Received: from BLU437-SMTP61 ([65.55.111.71]) by BLU004-OMC2S28.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Mon, 16 Mar 2015 07:10:32 -0700 X-TMN: [eVWlQYz0my57fKCHi8jE1AaagxeKdRQU] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: Standalone == Dev Only? From: Michael Segel In-Reply-To: Date: Mon, 16 Mar 2015 09:10:29 -0500 Content-Transfer-Encoding: quoted-printable References: <54FA078B.90009@gmail.com> To: user@hbase.apache.org X-Mailer: Apple Mail (2.2070.6) X-OriginalArrivalTime: 16 Mar 2015 14:10:30.0546 (UTC) FILETIME=[F5DE5F20:01D05FF2] X-Virus-Checked: Checked by ClamAV on apache.org I guess the old adage is true.=20 If you only have a hammer, then every problem looks like a nail.=20 As an architect, its your role to find the right tools to be used to = solve the problem in the most efficient and effective manner. =20 So the first question you need to ask is if HBase is the right tool.=20 The OP=E2=80=99s project isn=E2=80=99t one that should be put in to = HBase.=20 Velocity? Volume? Variety?=20 These are the three aspects of Big Data and they can also be used to = test if a problem should be solved using HBase. You don=E2=80=99t need = all three, but you should have at least two of the three if you have a = good candidate.=20 The other thing to consider is how you plan on using the data. If = you=E2=80=99re not using M/R or HDFS, then you don=E2=80=99t want to use = HBase in production.=20 And as a good architect, you want to take the inverse of the problem and = ask why not a Relational Database, or an existing Hierarchical Database.=20= (Both technologies have been around 30+ years.) And it turns out that = you can=20 So the OP=E2=80=99s problem lacks the volume.=20 It also lacks the variety.=20 So if we ask a simple question of how to use an RDBMS to handle this=E2=80= =A6 its pretty straight forward.=20 Store the medical record(s) in either XML or JSON format.=20 On ingestion, copy out only the fields required to identify an unique = record. That=E2=80=99s your base record storage.=20 Indexing could be done one of two ways.=20 1) You could use an inverted table.=20 2) You could copy out the field to be used in the index as a column and = then index that column.=20 If you use an inverted table, your schema design would translate in to = HBase.=20 Then when you access the data, you use the index to find the result set = and for each record, you have the JSON object that you can use as a = whole or just components.=20 The pattern of storing the record in a single column as Text LOB and = then creating indexes to identify and locate the records isn=E2=80=99t = new. I=E2=80=99ve used it at a client over 15 yrs ago for an ODS = implementation.=20 In terms of HBase=E2=80=A6=20 Stability depends on the hardware, admin and the use cases. Its still = relatively unstable. In most cases no where near 4 9=E2=80=99s.=20 Considering that there is also the regulatory compliance issues =E2=80=A6 = e.g. security=E2=80=A6 This alone will rule HBase out in a stand alone = situation and again even with Kerberos implemented, you may not meet = your security requirements.=20 Bottom line, the OP is going to do what he=E2=80=99s going to do. All I = can do is tell him its not a good idea, and why.=20 This email thread is great column fodder for a blog as well as for a = presentation as to why/why not HBase and Hadoop. Its something that = should be included in a design lecture or lectures, but unfortunately, = most of the larger conferences are driven by the vendors who have their = own agendas and slots that they want to fill with marketing talks.=20 BTW, I am really curious as to how if the OP is using a standalone = instance of HBase does the immature HDFS encryption help secure his = data? ;-)=20 HTH -Mike > On Mar 13, 2015, at 3:44 PM, Sean Busbey wrote: >=20 > On Fri, Mar 13, 2015 at 2:41 PM, Michael Segel = > wrote: >=20 >>=20 >> In stand alone, you=E2=80=99re writing to local disk. You lose the = disk you lose >> the data, unless of course you=E2=80=99ve raided your drives. >> Then when you lose the node, you lose the data because its not being >> replicated. While this may not be a major issue or concern=E2=80=A6 = you have to be >> aware of it=E2=80=99s potential. >>=20 >>=20 > It sounds like he has this issue covered via VM imaging. >=20 >=20 >=20 >> The other issue when it comes to security, HBase relies on the = cluster=E2=80=99s >> security. >> To be clear, HBase relies on the cluster and the use of Kerberos to = help >> with authentication. So that only those who have the rights to see = the >> data can actually have access to it. >>=20 >>=20 >=20 > He can get around this by relying on the Thrift or REST services to = act an > an arbitrator, or he could make his own. So long as he separates = access to > the underlying cluster / hbase apis from whatever does exposing the = data, > this shouldn't be a problem. >=20 >=20 >=20 >> Then you have to worry about auditing. With respect to HBase, out of = the >> box, you don=E2=80=99t have any auditing. >>=20 >>=20 >=20 > HBase has auditing. By default it is disabled and it certainly could = use > some improvement. Documentation would be a good start. I'm sure the > community would be happy to work with Joseph to close whatever gap he = needs. >=20 >=20 >=20 >=20 >> You also don=E2=80=99t have built in encryption. >> You can do it, but then you have a bit of work ahead of you. >> Cell level encryption? Accumulo? >>=20 >>=20 > HBase as had encryption since within the 0.98 line. It is stable now = in the > 1.0 release line. HDFS also supports encryption, though I'm sure using = it > with the LocalFileSystem would benefit from testing. There are vendors = that > can help with integration with proper key servers, if that is = something > Joseph needs and doesn't want to do on his own. >=20 > Accumulo does not do cell level encryption. >=20 >=20 >=20 >> There=E2=80=99s definitely more to it. >>=20 >> But the one killer thing=E2=80=A6 you need to be HIPPA compliant and = the simplest >> way to do this is to use a real RDBMS. If you need extensibility, = look at >> IDS from IBM (IBM bought Informix ages ago.) >>=20 >> I think based on the size of your data=E2=80=A6 you can get away with = the free >> version, and even if not, IBM does do discounts with Universities and = could >> even sponsor research projects. >>=20 >> I don=E2=80=99t know your data, but 10^6 rows is still small. >>=20 >> The point I=E2=80=99m trying to make is that based on what you=E2=80=99= ve said, HBase is >> definitely not the right database for you. >>=20 >>=20 > We haven't heard what the target data set size is. If Joseph has = reason to > believe that it will be big enough to warrant something like HBase = (e.g. > 10s of billions of rows), I think there's merit to his argument for > starting with HBase. Single node use cases are definitely not = something > we've covered well to date, but it would probably help our overall > usability story to do so. >=20 >=20 > --=20 > Sean The opinions expressed here are mine, while they may reflect a cognitive = thought, that is purely accidental.=20 Use at your own risk.=20 Michael Segel michael_segel (AT) hotmail.com