Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CCCA611C67 for ; Tue, 24 Jun 2014 17:51:08 +0000 (UTC) Received: (qmail 86306 invoked by uid 500); 24 Jun 2014 17:51:08 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 86266 invoked by uid 500); 24 Jun 2014 17:51:08 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 86256 invoked by uid 99); 24 Jun 2014 17:51:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jun 2014 17:51:08 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,TVD_FW_GRAPHIC_NAME_MID X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.213.51 as permitted sender) Received: from [209.85.213.51] (HELO mail-yh0-f51.google.com) (209.85.213.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jun 2014 17:51:05 +0000 Received: by mail-yh0-f51.google.com with SMTP id f10so415887yha.10 for ; Tue, 24 Jun 2014 10:50:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=xRc8M0+sGgFVvr0zWp/m8r5vlO2uhiaroSh40PjsoZ8=; b=iOQ17sQ6I7UP6wmUEadi7bkhde1jcsvMriXCkeX90p1s+QkZ7xkHyAumVvANYzgLsC rfOuJSWxlzQ1p6NnI921DvYZ0BQHh+XRXvzLLZKCuQggPMe1v7ShR66haqdOA9MfrRUW VySZ49KpMPQ9U1AFj2desuPJfFPV4tK6P3UJ25R+iiwE7RD2WxtcwvoUd59FjNINPUuQ orQrjCUImRfPtsKJV2VinE3+nx7N0CRZS7HMx2a8ZJzdSYzogsJAVBgqw6bTN8UiGXE4 h79em6V8oVyHwP6e25kAGqodrMmtKj05+I1Qpj2B8JRNnamki5Ptt/7Ud9/b767Qxj11 1d0g== MIME-Version: 1.0 X-Received: by 10.236.159.67 with SMTP id r43mr3710582yhk.50.1403632240388; Tue, 24 Jun 2014 10:50:40 -0700 (PDT) Received: by 10.170.55.137 with HTTP; Tue, 24 Jun 2014 10:50:40 -0700 (PDT) In-Reply-To: References: <20140623233721.GA6465@ll.mit.edu> Date: Tue, 24 Jun 2014 10:50:40 -0700 Message-ID: Subject: Re: How does Accumulo compare to HBase From: Ted Yu To: "user@accumulo.apache.org" Content-Type: multipart/related; boundary=20cf30434c641584bf04fc989925 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30434c641584bf04fc989925 Content-Type: multipart/alternative; boundary=20cf30434c641584bc04fc989924 --20cf30434c641584bc04fc989924 Content-Type: text/plain; charset=UTF-8 Thanks for the update. In your experiment so far, how many columns were involved ? Cheers On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang wrote: > +Update: > > Possibly 100s Billion of columns. > > > On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang > wrote: > >> Hi Ted, >> >> CF: maybe dozens >> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId) >> >> Make sense? >> >> Jianshi >> >> >> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu wrote: >> >>> Jianshi: >>> How many column families and columns are you expecting (maximum) in your >>> largest table ? >>> >>> Cheers >>> >>> >>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang >>> wrote: >>> >>>> Hi David, >>>> >>>> I did, it's a wonderful piece of work and for reviewing facts in a >>>> networks it's a great tool. (And Lumify looks really nice) >>>> >>>> However, my queries are mostly time-bound (from time A to time B), and >>>> to make some query real-time (< 50ms), I have to roll out my own schema and >>>> index, to denormalize properties and to incrementally do aggregations. I >>>> don't think there're existing solution in Graph database that can do these. >>>> >>>> And it's really fun to implement it myself. :) >>>> >>>> Please correct me if I'm wrong >>>> >>>> Jianshi >>>> >>>> >>>> >>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets < >>>> david.medinets@gmail.com> wrote: >>>> >>>>> Did you get a chance to review http://securegraph.org/? SecureGraph >>>>> is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, >>>>> every Secure graph method requires authorizations and visibilities. >>>>> SecureGraph also supports multivalued properties as well as property >>>>> metadata. >>>>> >>>>> >>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang < >>>>> jianshi.huang@gmail.com> wrote: >>>>> >>>>>> Wow, so many replies and very educational. Thank you all! >>>>>> >>>>>> I'm working on a Graph backend that I hope the same infrastructure >>>>>> can support >>>>>> >>>>>> 1) interactive graph exploration and queries >>>>>> >>>>>> Answering what are the interactions among N users from time A to time >>>>>> B, and how are users connected (now and before). >>>>>> >>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching) in >>>>>> a network of accounts >>>>>> >>>>>> Answering questions like: what's the ratio of newly registered >>>>>> accounts in my 'connected' (need flexible definition) network, how fast >>>>>> does it change; Does the network has path satisfying A(CN) -> B(IT) -> >>>>>> C(US) where the age of path is less than 3 days; etc. >>>>>> >>>>>> 3) offline simulation of events or offline calculation of new >>>>>> features (used for building models), so I need to take snapshots and also >>>>>> save point-in-time data >>>>>> >>>>>> Having them all-in-one in the same infrastructure will greatly >>>>>> simplify the implementation. >>>>>> >>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above >>>>>> are fake and are not related to PayPal :) >>>>>> >>>>>> I made a prototype in the last two weeks for purpose 1) and my >>>>>> feeling about Accumulo is exactly what many of you has said: it just works! >>>>>> Very little admin work, Clean and clear documentation and APIs. One thing I >>>>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node, >>>>>> but it's already very satisfying. :) >>>>>> >>>>>> BTW, from Mike's slides it seems HBase is much faster in read >>>>>> throughput if the number of columns is small. Any comments? What about >>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency for >>>>>> cold data (say I just restarted my cluster)? >>>>>> >>>>>> >>>>>> Jianshi >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>>>>> wilhelm.von.cloud@accumulo.net> wrote: >>>>>> >>>>>>> I think first and foremost, how has writing your application been? >>>>>>> Is it something you can easily onboard other people for? Does it seem >>>>>>> stable enough? If you can answer those questions positively, I think you >>>>>>> have a winning situation. >>>>>>> >>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>>>>> provide some level of support for Accumulo, so it has the pedigree of other >>>>>>> members of the Hadoop ecosystem. >>>>>>> >>>>>>> Regarding the performance, I think Mike's presentation needs some >>>>>>> context. He can definitely provide more context than the rest of us (and >>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving home is >>>>>>> that out of the box, Accumulo is configured to run on someone's laptop. >>>>>>> There are adjustments to be made when running at any scale greater than a >>>>>>> dev machine and they may not be documented clearly. >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra < >>>>>>> tsluthra@us.ibm.com> wrote: >>>>>>> >>>>>>>> Mike did a pretty good presentation on performance comparison >>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty detailed in >>>>>>>> the approach take and apples-apples comparison >>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 >>>>>>>> 07:42:57 PM---Performance is probably the largest difference between Accu]Jeremy >>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest >>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>>>>> >>>>>>>> From: Jeremy Kepner >>>>>>>> To: >>>>>>>> Date: 06/23/2014 07:42 PM >>>>>>>> Subject: Re: How does Accumulo compare to HBase >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Performance is probably the largest difference between Accumulo and >>>>>>>> HBase. >>>>>>>> >>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>>>>> This performance scales well into the hundreds of nodes to deliver >>>>>>>> 100M+ entries/sec. >>>>>>>> >>>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed >>>>>>>> literature. >>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>>>>> performance. >>>>>>>> >>>>>>>> In short, one can often replace a 20+ node database with >>>>>>>> a single node Accumulo database. >>>>>>>> >>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>>>>> > Er... basically I need to explain to my manager why choosing >>>>>>>> Accumulo, >>>>>>>> > instead of HBase. >>>>>>>> > >>>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase >>>>>>>> 0.98 also >>>>>>>> > got cell-level security, modeled after Accumulo) >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Jianshi Huang >>>>>>>> > >>>>>>>> > LinkedIn: jianshi >>>>>>>> > Twitter: @jshuang >>>>>>>> > Github & Blog: http://huangjs.github.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jianshi Huang >>>>>> >>>>>> LinkedIn: jianshi >>>>>> Twitter: @jshuang >>>>>> Github & Blog: http://huangjs.github.com/ >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jianshi Huang >>>> >>>> LinkedIn: jianshi >>>> Twitter: @jshuang >>>> Github & Blog: http://huangjs.github.com/ >>>> >>> >>> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > --20cf30434c641584bc04fc989924 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for the update.

In your experime= nt so far, how many columns were involved ?

Cheers=


O= n Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <jianshi.huang@gmail.= com> wrote:
+Update:

Possibly 100s Billion of columns.


On Wed, Jun 2= 5, 2014 at 12:03 AM, Jianshi Huang <jianshi.huang@gmail.com><= /span> wrote:
Hi Ted,

= CF: maybe dozens
Columns: billions (rowkey =3D nodeId, CF =3D eve= nt type, CQ =3D Index+eventId)

Make sense?
Jianshi


On Tue, Jun 2= 4, 2014 at 10:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
Jianshi:
How many column families and columns are you = expecting (maximum) in your largest table ?

Cheers=


On Tue, Jun 24, 2014 at 7:29 AM, Jianshi= Huang <jianshi.huang@gmail.com> wrote:
Hi David,

I did, it's a wonderful piece of work and for reviewing facts in a ne= tworks it's a great tool. (And Lumify looks really nice)

However, my queries are mostly time-bound (from time A = to time B), and to make some query real-time (< 50ms), I have to roll ou= t my own schema and index, to denormalize properties and to incrementally d= o aggregations. I don't think there're existing solution in Graph d= atabase that can do these.

And it's really fun to implement it myself. :)

Please correct me if I'm wrong

Jianshi



On Tue, Jun 24, 2014 at 10:10 PM, David Medinets= <david.medinets@gmail.com> wrote:
Did you get a chance to review=C2=A0http://securegraph.org/?=C2=A0SecureGraph is an API to manipulate graphs, similar to Blueprint= s. Unlike Blueprints, every Secure graph method requires authorizations and= visibilities. SecureGraph also supports multivalued properties as well as = property metadata.


On Tue, Jun 2= 4, 2014 at 9:51 AM, Jianshi Huang <jianshi.huang@gmail.com> wrote:
Wow, so many replies and ve= ry educational. Thank you all!

I'm working on a Grap= h backend that I hope the same infrastructure can support=C2=A0

1) interactive graph exploration and queries

Answering what are the interactions among N users from = time A to time B, and how are users connected (now and before).
<= br>
2) real-time (<100ms) feature calculation (aggregation, ma= tching) in a network of accounts

Answering questions like: what's the ratio of newly= registered accounts in my 'connected' (need flexible definition) n= etwork, how fast does it change; Does the network has path satisfying A(CN)= -> B(IT) -> C(US) where the age of path is less than 3 days; etc.

3) offline simulation of events or offline calculation = of new features (used for building models), so I need to take snapshots and= also save point-in-time data

Having them all-in-o= ne in the same infrastructure will greatly simplify the implementation.

BTW, I'm working for PayPal, Risk Data Science. (Al= l questions above are fake and are not related to PayPal :)

<= /div>
I made a prototype in the last two weeks for purpose 1) and my fe= eling about Accumulo is exactly what many of you has said: it just works! V= ery little admin work, Clean and clear documentation and APIs. One thing I = haven't got right was high-speed ingestion, I only got 100K rows/sec/no= de, but it's already very satisfying. :)

BTW, from Mike's slides it seems HBase is much fast= er in read throughput if the number of columns is small. Any comments? What= about latency? Can I cache all data in memory in Accumulo to reduce latenc= y for cold data (say I just restarted my cluster)?


Jianshi




On T= ue, Jun 24, 2014 at 10:41 AM, William Slacum <wilhelm.von.cl= oud@accumulo.net> wrote:
I think first and foremost,= how has writing your application been? Is it something you can easily onbo= ard other people for? Does it seem stable enough? If you can answer those q= uestions positively, I think you have a winning situation.

The big three Hadoop vendors (Cloudera, Hortonworks and MapR= ) all provide some level of support for Accumulo, so it has the pedigree of= other members of the Hadoop ecosystem.

Regarding = the performance, I think Mike's presentation needs some context. He can= definitely provide more context than the rest of us (and possibly Sean or = Bill |-|), but I think one thing he was driving home is that out of the box= , Accumulo is configured to run on someone's laptop. There are adjustme= nts to be made when running at any scale greater than a dev machine and the= y may not be documented clearly.


On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <= ;tsluthra@us.ibm.c= om> wrote:

Mike did a pretty good presentation on perform= ance comparison between Accumulo / HBase. Again not official IMO but is pre= tty detailed in the approach take and apples-apples comparison
http://www.slideshare.net/AccumuloSummit/1= 0-30-drob



3D"InactiveJeremy Kepner --= -06/23/2014 07:42:57 PM---Performance is probably the largest difference be= tween Accumulo and HBase. Accumulo can ingest/scan

From: Jeremy Kepner <kepner@ll.mit.edu>
To: <user@accumulo.apache.org>
Date: 06/23/2014 07:42 PM
Subject: Re: How does Accumulo compare to HBase





Performance is probably the largest difference between Accumulo a= nd HBase.

Accumulo can ingest/scan at a rate of 800K entries/sec/node.
This performance scales well into the hundreds of nodes to deliver
100M+ entries/sec.

There are no recent HBase benchmarks and none in the peer-reviewed literatu= re.
Old data suggests that HBase performance is ~1% of Accumulo performance.
In short, one can often replace a 20+ node database with
a single node Accumulo database.

On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
> Er... basically I need to explain to my manager why choosing Accumulo,=
> instead of HBase.
>
> So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98 a= lso
> got cell-level security, modeled after Accumulo)
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog:
http://huangjs.github.com/<= font>



=


--
Jianshi Huang

LinkedIn: = jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



=

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @j= shuang
Github & Blog: http://huangjs.github.com/




--
= Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github &= Blog: http://huan= gjs.github.com/



--
= Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github &= Blog: http://huan= gjs.github.com/

--20cf30434c641584bc04fc989924-- --20cf30434c641584bf04fc989925 Content-Type: image/gif; name="graycol.gif" Content-Disposition: inline; filename="graycol.gif" Content-Transfer-Encoding: base64 Content-ID: <1__=0ABBF792DF92D95D8f9e8a93df938@us.ibm.com> X-Attachment-Id: 36332c9b1693dc4b_0.1 R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7 --20cf30434c641584bf04fc989925--