Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates
 209.85.160.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACA1tW+9A_+P58ypYvojFbkvH39JjvgWH8O_Ctbig0_LwEfnJg@mail.gmail.com>
References: 
 <CACA1tWKLoi+Wc67FH4nFU_0aYBWrdVa+u-gPGR=rCf8pQ0zWNw@mail.gmail.com>
	<20140623233721.GA6465@ll.mit.edu>
	<OF9DB4A058.4625A1DE-ON85257D01.00015FCD-85257D01.000186F9@us.ibm.com>
	<CAMz+DuvtnxzOw38jZpuFiHKFX+enM31LjunE6oaiHTm1C-8Uyw@mail.gmail.com>
	<CACA1tW+VGsndmzQDNfimGNppxPuvyW-z5u8G3wak4YhwN2Su6A@mail.gmail.com>
	<CAOiJXP4eaYzQNkn1CAZ807dSYdwGTUx2JBz7bXzP-_KdoRK_Rw@mail.gmail.com>
	<CACA1tW+9A_+P58ypYvojFbkvH39JjvgWH8O_Ctbig0_LwEfnJg@mail.gmail.com>
Date: Tue, 24 Jun 2014 07:33:59 -0700
Message-ID: 
 <CALte62wdp=pBidvVs5ZJoU8-Dw=q=ftLq3nXCFett6ZohHDSoA@mail.gmail.com>
Subject: Re: How does Accumulo compare to HBase
From: Ted Yu <yuzhihong@gmail.com>
To: "user@accumulo.apache.org" <user@accumulo.apache.org>
Content-Type: multipart/related; boundary=089e01635714bc8bc604fc95d97b

--089e01635714bc8bc604fc95d97b
Content-Type: multipart/alternative; boundary=089e01635714bc8bc304fc95d97a

--089e01635714bc8bc304fc95d97a
Content-Type: text/plain; charset=UTF-8

Jianshi:
How many column families and columns are you expecting (maximum) in your
largest table ?

Cheers


On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Hi David,
>
> I did, it's a wonderful piece of work and for reviewing facts in a
> networks it's a great tool. (And Lumify looks really nice)
>
> However, my queries are mostly time-bound (from time A to time B), and to
> make some query real-time (< 50ms), I have to roll out my own schema and
> index, to denormalize properties and to incrementally do aggregations. I
> don't think there're existing solution in Graph database that can do these.
>
> And it's really fun to implement it myself. :)
>
> Please correct me if I'm wrong
>
> Jianshi
>
>
>
> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets <david.medinets@gmail.com
> > wrote:
>
>> Did you get a chance to review http://securegraph.org/? SecureGraph is
>> an API to manipulate graphs, similar to Blueprints. Unlike Blueprints,
>> every Secure graph method requires authorizations and visibilities.
>> SecureGraph also supports multivalued properties as well as property
>> metadata.
>>
>>
>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Wow, so many replies and very educational. Thank you all!
>>>
>>> I'm working on a Graph backend that I hope the same infrastructure can
>>> support
>>>
>>> 1) interactive graph exploration and queries
>>>
>>> Answering what are the interactions among N users from time A to time B,
>>> and how are users connected (now and before).
>>>
>>> 2) real-time (<100ms) feature calculation (aggregation, matching) in a
>>> network of accounts
>>>
>>> Answering questions like: what's the ratio of newly registered accounts
>>> in my 'connected' (need flexible definition) network, how fast does it
>>> change; Does the network has path satisfying A(CN) -> B(IT) -> C(US) where
>>> the age of path is less than 3 days; etc.
>>>
>>> 3) offline simulation of events or offline calculation of new features
>>> (used for building models), so I need to take snapshots and also save
>>> point-in-time data
>>>
>>> Having them all-in-one in the same infrastructure will greatly simplify
>>> the implementation.
>>>
>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above are
>>> fake and are not related to PayPal :)
>>>
>>> I made a prototype in the last two weeks for purpose 1) and my feeling
>>> about Accumulo is exactly what many of you has said: it just works! Very
>>> little admin work, Clean and clear documentation and APIs. One thing I
>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node,
>>> but it's already very satisfying. :)
>>>
>>> BTW, from Mike's slides it seems HBase is much faster in read throughput
>>> if the number of columns is small. Any comments? What about latency? Can I
>>> cache all data in memory in Accumulo to reduce latency for cold data (say I
>>> just restarted my cluster)?
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>>
>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum <
>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>
>>>> I think first and foremost, how has writing your application been? Is
>>>> it something you can easily onboard other people for? Does it seem stable
>>>> enough? If you can answer those questions positively, I think you have a
>>>> winning situation.
>>>>
>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all
>>>> provide some level of support for Accumulo, so it has the pedigree of other
>>>> members of the Hadoop ecosystem.
>>>>
>>>> Regarding the performance, I think Mike's presentation needs some
>>>> context. He can definitely provide more context than the rest of us (and
>>>> possibly Sean or Bill |-|), but I think one thing he was driving home is
>>>> that out of the box, Accumulo is configured to run on someone's laptop.
>>>> There are adjustments to be made when running at any scale greater than a
>>>> dev machine and they may not be documented clearly.
>>>>
>>>>
>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <tsluthra@us.ibm.com
>>>> > wrote:
>>>>
>>>>> Mike did a pretty good presentation on performance comparison between
>>>>> Accumulo / HBase. Again not official IMO but is pretty detailed in the
>>>>> approach take and apples-apples comparison
>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob
>>>>>
>>>>>
>>>>>
>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 07:42:57
>>>>> PM---Performance is probably the largest difference between Accu]Jeremy
>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest
>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan
>>>>>
>>>>> From: Jeremy Kepner <kepner@ll.mit.edu>
>>>>> To: <user@accumulo.apache.org>
>>>>> Date: 06/23/2014 07:42 PM
>>>>> Subject: Re: How does Accumulo compare to HBase
>>>>> ------------------------------
>>>>>
>>>>>
>>>>>
>>>>> Performance is probably the largest difference between Accumulo and
>>>>> HBase.
>>>>>
>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node.
>>>>> This performance scales well into the hundreds of nodes to deliver
>>>>> 100M+ entries/sec.
>>>>>
>>>>> There are no recent HBase benchmarks and none in the peer-reviewed
>>>>> literature.
>>>>> Old data suggests that HBase performance is ~1% of Accumulo
>>>>> performance.
>>>>>
>>>>> In short, one can often replace a 20+ node database with
>>>>> a single node Accumulo database.
>>>>>
>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
>>>>> > Er... basically I need to explain to my manager why choosing
>>>>> Accumulo,
>>>>> > instead of HBase.
>>>>> >
>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98
>>>>> also
>>>>> > got cell-level security, modeled after Accumulo)
>>>>> >
>>>>> > --
>>>>> > Jianshi Huang
>>>>> >
>>>>> > LinkedIn: jianshi
>>>>> > Twitter: @jshuang
>>>>> > Github & Blog: http://huangjs.github.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

--089e01635714bc8bc304fc95d97a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Jianshi:<div>How many column families and columns are you =
expecting (maximum) in your largest table ?</div><div><br></div><div>Cheers=
</div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">O=
n Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jianshi.huang@gmail.com" target=3D"_blank">jianshi.huang@gmail.c=
om</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi David,<div><br></div><di=
v>I did, it&#39;s a wonderful piece of work and for reviewing facts in a ne=
tworks it&#39;s a great tool. (And Lumify looks really nice)</div>
<div><br></div><div>However, my queries are mostly time-bound (from time A =
to time B), and to make some query real-time (&lt; 50ms), I have to roll ou=
t my own schema and index, to denormalize properties and to incrementally d=
o aggregations. I don&#39;t think there&#39;re existing solution in Graph d=
atabase that can do these.</div>


<div><br></div><div>And it&#39;s really fun to implement it myself. :)</div=
><div><br></div><div>Please correct me if I&#39;m wrong</div><div><br></div=
><div>Jianshi</div><div><br></div></div><div class=3D"gmail_extra"><br><br>


<div class=3D"gmail_quote">On Tue, Jun 24, 2014 at 10:10 PM, David Medinets=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:david.medinets@gmail.com" target=
=3D"_blank">david.medinets@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">Did you get a chance to review=C2=A0<a href=3D"http://secu=
regraph.org/" target=3D"_blank">http://securegraph.org/</a>?=C2=A0<span sty=
le=3D"color:rgb(55,55,55);font-family:&#39;Myriad Pro&#39;,Calibri,Helvetic=
a,Arial,sans-serif;font-size:16px;line-height:24px;background-color:rgb(242=
,242,242)">SecureGraph is an API to manipulate graphs, similar to Blueprint=
s. Unlike Blueprints, every Secure graph method requires authorizations and=
 visibilities. SecureGraph also supports multivalued properties as well as =
property metadata.</span></div>
<div><div class=3D"h5">

<div><div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, Jun 2=
4, 2014 at 9:51 AM, Jianshi Huang <span dir=3D"ltr">&lt;<a href=3D"mailto:j=
ianshi.huang@gmail.com" target=3D"_blank">jianshi.huang@gmail.com</a>&gt;</=
span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Wow, so many replies and ve=
ry educational. Thank you all!<div><br></div><div>I&#39;m working on a Grap=
h backend that I hope the same infrastructure can support=C2=A0</div>


<div><br></div><div>1) interactive graph exploration and queries</div>

<div><br></div><div>Answering what are the interactions among N users from =
time A to time B, and how are users connected (now and before).</div><div><=
br></div><div>2) real-time (&lt;100ms) feature calculation (aggregation, ma=
tching) in a network of accounts</div>


<div><br></div><div>Answering questions like: what&#39;s the ratio of newly=
 registered accounts in my &#39;connected&#39; (need flexible definition) n=
etwork, how fast does it change; Does the network has path satisfying A(CN)=
 -&gt; B(IT) -&gt; C(US) where the age of path is less than 3 days; etc.</d=
iv>


<div><br></div><div>3) offline simulation of events or offline calculation =
of new features (used for building models), so I need to take snapshots and=
 also save point-in-time data</div><div><br></div><div>Having them all-in-o=
ne in the same infrastructure will greatly simplify the implementation.</di=
v>


<div><br></div><div>BTW, I&#39;m working for PayPal, Risk Data Science. (Al=
l questions above are fake and are not related to PayPal :)</div><div><br><=
/div><div>I made a prototype in the last two weeks for purpose 1) and my fe=
eling about Accumulo is exactly what many of you has said: it just works! V=
ery little admin work, Clean and clear documentation and APIs. One thing I =
haven&#39;t got right was high-speed ingestion, I only got 100K rows/sec/no=
de, but it&#39;s already very satisfying. :)</div>


<div><br></div><div>BTW, from Mike&#39;s slides it seems HBase is much fast=
er in read throughput if the number of columns is small. Any comments? What=
 about latency? Can I cache all data in memory in Accumulo to reduce latenc=
y for cold data (say I just restarted my cluster)?</div>


<div><br></div><div><br></div><div>Jianshi</div><div><br></div><div><br></d=
iv></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On T=
ue, Jun 24, 2014 at 10:41 AM, William Slacum <span dir=3D"ltr">&lt;<a href=
=3D"mailto:wilhelm.von.cloud@accumulo.net" target=3D"_blank">wilhelm.von.cl=
oud@accumulo.net</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I think first and foremost,=
 how has writing your application been? Is it something you can easily onbo=
ard other people for? Does it seem stable enough? If you can answer those q=
uestions positively, I think you have a winning situation.<div>


<br></div><div>The big three Hadoop vendors (Cloudera, Hortonworks and MapR=
) all provide some level of support for Accumulo, so it has the pedigree of=
 other members of the Hadoop ecosystem.</div><div><br></div><div>Regarding =
the performance, I think Mike&#39;s presentation needs some context. He can=
 definitely provide more context than the rest of us (and possibly Sean or =
Bill |-|), but I think one thing he was driving home is that out of the box=
, Accumulo is configured to run on someone&#39;s laptop. There are adjustme=
nts to be made when running at any scale greater than a dev machine and the=
y may not be documented clearly.</div>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <span dir=3D"ltr">&lt=
;<a href=3D"mailto:tsluthra@us.ibm.com" target=3D"_blank">tsluthra@us.ibm.c=
om</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>
<p><font face=3D"sans-serif">Mike did a pretty good presentation on perform=
ance comparison between Accumulo / HBase. Again not official IMO but is pre=
tty detailed in the approach take and apples-apples comparison</font><br>


<a href=3D"http://www.slideshare.net/AccumuloSummit/10-30-drob" target=3D"_=
blank"><font face=3D"sans-serif">http://www.slideshare.net/AccumuloSummit/1=
0-30-drob</font></a><br>
<font face=3D"sans-serif"><br>
</font><br>
<br>
<img width=3D"16" height=3D"16" src=3D"cid:1__=3D0ABBF792DF92D95D8f9e8a93df=
938@us.ibm.com" border=3D"0" alt=3D"Inactive hide details for Jeremy Kepner=
 ---06/23/2014 07:42:57 PM---Performance is probably the largest difference=
 between Accu"><font color=3D"#424282" face=3D"sans-serif">Jeremy Kepner --=
-06/23/2014 07:42:57 PM---Performance is probably the largest difference be=
tween Accumulo and HBase. Accumulo can ingest/scan</font><br>


<br>
<font size=3D"1" color=3D"#5F5F5F" face=3D"sans-serif">From:	</font><font s=
ize=3D"1" face=3D"sans-serif">Jeremy Kepner &lt;<a href=3D"mailto:kepner@ll=
.mit.edu" target=3D"_blank">kepner@ll.mit.edu</a>&gt;</font><br>
<font size=3D"1" color=3D"#5F5F5F" face=3D"sans-serif">To:	</font><font siz=
e=3D"1" face=3D"sans-serif">&lt;<a href=3D"mailto:user@accumulo.apache.org"=
 target=3D"_blank">user@accumulo.apache.org</a>&gt;</font><br>
<font size=3D"1" color=3D"#5F5F5F" face=3D"sans-serif">Date:	</font><font s=
ize=3D"1" face=3D"sans-serif">06/23/2014 07:42 PM</font><br>
<font size=3D"1" color=3D"#5F5F5F" face=3D"sans-serif">Subject:	</font><fon=
t size=3D"1" face=3D"sans-serif">Re: How does Accumulo compare to HBase</fo=
nt><br>
</p><hr width=3D"100%" size=3D"2" align=3D"left" noshade style=3D"color:#80=
91a5"><div><div><br>
<br>
<br>
<tt><font>Performance is probably the largest difference between Accumulo a=
nd HBase.<br>
<br>
Accumulo can ingest/scan at a rate of 800K entries/sec/node.<br>
This performance scales well into the hundreds of nodes to deliver<br>
100M+ entries/sec.<br>
<br>
There are no recent HBase benchmarks and none in the peer-reviewed literatu=
re.<br>
Old data suggests that HBase performance is ~1% of Accumulo performance.<br=
>
<br>
In short, one can often replace a 20+ node database with<br>
a single node Accumulo database.<br>
<br>
On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:<br>
&gt; Er... basically I need to explain to my manager why choosing Accumulo,=
<br>
&gt; instead of HBase.<br>
&gt; <br>
&gt; So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98 a=
lso<br>
&gt; got cell-level security, modeled after Accumulo)<br>
&gt; <br>
&gt; -- <br>
&gt; Jianshi Huang<br>
&gt; <br>
&gt; LinkedIn: jianshi<br>
&gt; Twitter: @jshuang<br>
&gt; Github &amp; Blog: </font></tt><tt><font><a href=3D"http://huangjs.git=
hub.com/" target=3D"_blank">http://huangjs.github.com/</a></font></tt><tt><=
font><br>
<br>
</font></tt><br>
</div></div><p></p></div>
</blockquote></div><br></div><span><font color=3D"#888888">
</font></span></div></div></blockquote></div><span><font color=3D"#888888">=
<br><br clear=3D"all"><div><br></div>-- <br>Jianshi Huang<br><br>LinkedIn: =
jianshi<br>Twitter: @jshuang<br>Github &amp; Blog: <a href=3D"http://huangj=
s.github.com/" target=3D"_blank">http://huangjs.github.com/</a><br>


</font></span></div>
</blockquote></div><br></div>
</div></div></div></div></blockquote></div><div><div class=3D"h5"><br><br c=
lear=3D"all"><div><br></div>-- <br>Jianshi Huang<br><br>LinkedIn: jianshi<b=
r>Twitter: @jshuang<br>Github &amp; Blog: <a href=3D"http://huangjs.github.=
com/" target=3D"_blank">http://huangjs.github.com/</a><br>


</div></div></div>
</blockquote></div><br></div>

--089e01635714bc8bc304fc95d97a--

--089e01635714bc8bc604fc95d97b
Content-Type: image/gif; name="graycol.gif"
Content-Disposition: inline; filename="graycol.gif"
Content-Transfer-Encoding: base64
Content-ID: <1__=0ABBF792DF92D95D8f9e8a93df938@us.ibm.com>
X-Attachment-Id: 36332c9b1693dc4b_0.1

R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu
ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7
--089e01635714bc8bc604fc95d97b--