hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Standalone == Dev Only?
Date Fri, 13 Mar 2015 19:41:39 GMT
Joseph, 

In stand alone, you’re writing to local disk. You lose the disk you lose the data, unless
of course you’ve raided your drives. 
Then when you lose the node, you lose the data because its not being replicated. While this
may not be a major issue or concern… you have to be aware of it’s potential. 

The other issue when it comes to security, HBase relies on the cluster’s security. 
To be clear, HBase relies on the cluster and the use of Kerberos to help with authentication.
 So that only those who have the rights to see the data can actually have access to it. 

Then you have to worry about auditing. With respect to HBase, out of the box, you don’t
have any auditing. 

With respect to stability,  YMMV.  HBase is only as stable as the admin. 

You also don’t have built in encryption.  
You can do it, but then you have a bit of work ahead of you. 
Cell level encryption? Accumulo?

There’s definitely more to it. 

But the one killer thing… you need to be HIPPA compliant and the simplest way to do this
is to use a real RDBMS. If you need extensibility, look at IDS from IBM (IBM bought Informix
ages ago.) 

I think based on the size of your data… you can get away with the free version, and even
if not, IBM does do discounts with Universities and could even sponsor research projects.


I don’t know your data, but 10^6 rows is still small.  

The point I’m trying to make is that based on what you’ve said, HBase is definitely not
the right database for you. 


> On Mar 13, 2015, at 1:56 PM, Rose, Joseph <Joseph.Rose@childrens.harvard.edu> wrote:
> 
> Michael,
> 
> Thanks for your concern. Let me ask a few questions, since you’re implying
> that HDFS is the only way to reduce risk and ensure security, which is not
> the assumption under which I’ve been working.
> 
> A brief rundown of our problem’s characteristics, since I haven’t really
> described what we’re doing:
> * We’re read heavy, write light. It’s likely we’ll do a large import of
> the data and update less than 0.1% per day.
> * The dataset isn’t huge, at the moment (it will likely become huge in the
> future.) If I were to go the RDBMS route I’d guess it could all fit on a
> dual core i5 machine with 2G memory and a quarter terabyte disk — and that
> might be over spec’d. What we’re doing is functional and solves a certain
> problem but is also a prototype for a much larger dataset.
> * We do need security, you’re absolutely right, and the data is subject to
> HIPPA.
> * Availability should be good but we don’t have to go overboard. A couple
> of nines would be just fine.
> * We plan on running this on a fairly small VM. The VM will be backed up
> nightly.
> 
> So, with that in mind, let me make sure I’ve got this right.
> 
> Your main points were data loss and security. As I understand it, HDFS
> might be the right choice for dozens of terabytes to petabyte scale (where
> it effectively becomes impossible to do a clean backup, since the odds of
> a undetected, hardware-level error during replication are not
> insignificant, even if you can find enough space.) But we’re talking gigs
> — easily & reliably replicated (I do it on my home machine all the time.)
> And since it looks like HBase has a stable file system after committing
> mutations, shutting down changes, doing a backup & re-enabling mutations
> seem like a fine choice. Do you see a hole with this approach?
> 
> As for security, and as I understand it, HBase’s security model — both for
> tagging and encryption -- is built into the database layer, not HDFS. We
> very much want cell-level security with roles (because HIPPA) and
> encryption (also because HIPPA) but I don’t think that has anything to do
> with the underlying filesystem. Again, is there something here I’ve missed?
> 
> When we get to 10^6+ rows we will probably build out a small cluster.
> We’re well below that threshold at the moment but will get there soon
> enough.
> 
> 
> -j
> 
> 
> On 3/13/15, 1:46 PM, "Michael Segel" <michael_segel@hotmail.com <mailto:michael_segel@hotmail.com>>
wrote:
> 
>> Guys, 
>> 
>> More than just needing some love.
>> No HDFS… means data at risk.
>> No HDFS… means that stand alone will have security issues.
>> 
>> Patient Data? HINT: HIPPA.
>> 
>> Please think your design through and if you go w HBase… you will want to
>> build out a small cluster.
>> 
>>> On Mar 10, 2015, at 6:16 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>> 
>>> As Stack and Andrew said, just wanted to give you fair warning that this
>>> mode may need some love. Likewise, there are probably alternative that
>>> run
>>> a bit lighter weight, though you flatter us with the reminder of the
>>> long
>>> feature list.
>>> 
>>> I have no problem with helping to fix and committing fixes to bugs that
>>> crop up in local mode operations. Bring 'em on!
>>> 
>>> -n
>>> 
>>> On Tue, Mar 10, 2015 at 3:56 PM, Alex Baranau <alex.baranov.v@gmail.com>
>>> wrote:
>>> 
>>>> On:
>>>> 
>>>> - Future investment in a design that scales better
>>>> 
>>>> Indeed, designing against key value store is different from designing
>>>> against RDBMs.
>>>> 
>>>> I wonder if you explored an option to abstract the storage layer and
>>>> using
>>>> "single node purposed" store until you grow enough to switch to another
>>>> one?
>>>> 
>>>> E.g. you could use LevelDB [1] that is pretty fast (and there's java
>>>> rewrite of it, if you need java APIs [2]). We use it in CDAP [3] in a
>>>> standalone version to make the development environment (SDK) lighter.
>>>> We
>>>> swap it with HBase in distributed mode without changing the application
>>>> code. It doesn't have coprocessors and other specific to HBase
>>>> features you
>>>> are talking about, though. But you can figure out how to bridge client
>>>> APIs
>>>> with an abstraction layer (e.g. we have common Table interface [4]).
>>>> You
>>>> can even add versions on cells (see [5] for example of how we do it).
>>>> 
>>>> Also, you could use RDBMs behind key-value abstraction, to start with,
>>>> while keeping your app design clean out of RDBMs specifics.
>>>> 
>>>> Alex Baranau
>>>> 
>>>> [1] 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_google_l>
>>>> eveldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjE
>>>> n0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-
>>>> CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=WRQk8xqNYxyT3htTfBna2R_9bgKJZPB4tDyItgU
>>>> qwJI&e= 
>>>> [2] 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dain_lev>
>>>> eldb&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0
>>>> B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CC
>>>> J7rLU2XLh5RjJJOjub8v2AQzbZLo&s=YwiXrLkihDEPAbXTcIvLzRjYn7nT3DcOJRsuvpIwm
>>>> G0&e= 
>>>> [3] 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q
<https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q>
>>>> S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>> 7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>> b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=
>>>> [4]
>>>> 
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata>
>>>> _cdap_blob_develop_cdap-2Dapi_src_main_java_co_cask_cdap_api_dataset_tab
>>>> le_Table.java&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j
>>>> 9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1Lntz
>>>> oxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZLo&s=oMAOmpbfDimKx4TUp0xhVpWtww0oZ6Ar
>>>> Udol-UzgmFg&e= 
>>>> [5]
>>>> 
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_caskdata>
>>>> _cdap_blob_develop_cdap-2Ddata-2Dfabric_src_main_java_co_cask_cdap_data2
>>>> _dataset2_lib_table_leveldb_LevelDBTableCore.java&d=BQIFaQ&c=qS4goWBT7po
>>>> plM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN37RKmLLRc05
>>>> fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOjub8v2AQzbZL
>>>> o&s=3Fvtru1ABs6pL4sh0sE8Z-xPyy-m-GoqEWhyOHp3e-c&e=
>>>> 
>>>> --
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q
<https://urldefense.proofpoint.com/v2/url?u=http-3A__cdap.io&d=BQIFaQ&c=q>
>>>> S4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=j9wyupjEn0B7jf5KuX71llCBNN3
>>>> 7RKmLLRc05fkUwaA79i0DrYaVuQHxlqAccDLc&m=o1LntzoxFQvo-CCJ7rLU2XLh5RjJJOju
>>>> b8v2AQzbZLo&s=lXOGj-4TC5bxYeGvDmZwHQRlHTGlHU4MEpon_XqKNgU&e=  - open
>>>> source framework to build and run data applications
>>>> on Hadoop & HBase
>>>> 
>>>> On Tue, Mar 10, 2015 at 8:42 AM, Rose, Joseph <
>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>> 
>>>>> Sorry, never answered your question about versions. I have 1.0.0
>>>>> version
>>>>> of hbase, which has hadoop-common 2.5.1 in its lib folder.
>>>>> 
>>>>> 
>>>>> -j
>>>>> 
>>>>> 
>>>>> On 3/10/15, 11:36 AM, "Rose, Joseph"
>>>>> <Joseph.Rose@childrens.harvard.edu>
>>>>> wrote:
>>>>> 
>>>>>> I tried it and it does work now. It looks like the interface for
>>>>>> hadoop.fs.Syncable changed in March, 2012 to remove the deprecated
>>>> sync()
>>>>>> method and define only hsync() instead. The same committer did the
>>>>>> right
>>>>>> thing and removed sync() from FSDataOutputStream at the same time.
>>>>>> The
>>>>>> remaining hsync() method calls flush() if the underlying stream
>>>>>> doesn't
>>>>>> implement Syncable.
>>>>>> 
>>>>>> 
>>>>>> -j
>>>>>> 
>>>>>> 
>>>>>> On 3/6/15, 5:24 PM, "Stack" <stack@duboce.net> wrote:
>>>>>> 
>>>>>>> On Fri, Mar 6, 2015 at 1:50 PM, Rose, Joseph <
>>>>>>> Joseph.Rose@childrens.harvard.edu> wrote:
>>>>>>> 
>>>>>>>> I think the final issue with hadoop-common (re: unimplemented
sync
>>>> for
>>>>>>>> local filesystems) is the one showstopper for us. We have
to have
>>>>>>>> assured
>>>>>>>> durability. I¹m willing to devote some cycles to get it
done, so
>>>> maybe
>>>>>>>> I¹m
>>>>>>>> the one that says this problem is worthwhile.
>>>>>>>> 
>>>>>>>> 
>>>>>>> I remember that was once the case but looking in codebase now,
sync
>>>> calls
>>>>>>> through to ProtobufLogWriter which does a 'flush' on output (though
>>>>>>> comment
>>>>>>> says this is a noop). The output stream is an instance of
>>>>>>> FSDataOutputStream made with a RawLOS. The flush should come
out
>>>>>>> here:
>>>>>>> 
>>>>>>> 220     public void flush() throws IOException { fos.flush();
}
>>>>>>> 
>>>>>>> ... where fos is an instance of FileOutputStream.
>>>>>>> 
>>>>>>> In sync we go on to call hflush which looks like it calls flush
>>>>>>> again.
>>>>>>> 
>>>>>>> What hadoop/hbase versions we talking about? HADOOP-8861 added
the
>>>> above
>>>>>>> behavior for hadoop 1.2.
>>>>>>> 
>>>>>>> Try it I'd say.
>>>>>>> 
>>>>>>> St.Ack
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message