Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D938115EC for ; Tue, 24 Jun 2014 14:34:27 +0000 (UTC) Received: (qmail 12935 invoked by uid 500); 24 Jun 2014 14:34:27 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 12791 invoked by uid 500); 24 Jun 2014 14:34:27 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 12563 invoked by uid 99); 24 Jun 2014 14:34:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jun 2014 14:34:27 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,TVD_FW_GRAPHIC_NAME_MID X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.160.180 as permitted sender) Received: from [209.85.160.180] (HELO mail-yk0-f180.google.com) (209.85.160.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jun 2014 14:34:24 +0000 Received: by mail-yk0-f180.google.com with SMTP id 131so216206ykp.39 for ; Tue, 24 Jun 2014 07:34:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=zhQsA/++orUoHXkfDYrylNwiqLaSO/2oZl+bORU4d50=; b=w8QlCuafT6duzObZOjrhKrPbUnSCDaj/PzX+eBvMN3BX4dq29Y/ZM/F7JXp0otFY3C FPgDg0kpbRMidIqhDbPT9e4dR4NMoIFqRrQBlgVZq5mleP0GgXpT9TDUGdR6mOaf8eeY sru2UYE+eJKfmZMo9EZ/vrpVC6MZybKn8pu5wDAQlhsQsFQmVxsj01q/333Hu5ZkZqsZ fYskfY32ybiJgg3oYOqaBnDDSSxINysYnuvLoVcwD599YKOl8sS32NCRuUpI5M/bxWqm V314qk6VxHXlXLIFvHEaxJm+ZPzcXe2+y6SWULmQOUcd39r5IQard2hgLoaFApppB9E3 U1lw== MIME-Version: 1.0 X-Received: by 10.236.20.114 with SMTP id o78mr1846077yho.91.1403620440175; Tue, 24 Jun 2014 07:34:00 -0700 (PDT) Received: by 10.170.55.137 with HTTP; Tue, 24 Jun 2014 07:33:59 -0700 (PDT) In-Reply-To: References: <20140623233721.GA6465@ll.mit.edu> Date: Tue, 24 Jun 2014 07:33:59 -0700 Message-ID: Subject: Re: How does Accumulo compare to HBase From: Ted Yu To: "user@accumulo.apache.org" Content-Type: multipart/related; boundary=089e01635714bc8bc604fc95d97b X-Virus-Checked: Checked by ClamAV on apache.org --089e01635714bc8bc604fc95d97b Content-Type: multipart/alternative; boundary=089e01635714bc8bc304fc95d97a --089e01635714bc8bc304fc95d97a Content-Type: text/plain; charset=UTF-8 Jianshi: How many column families and columns are you expecting (maximum) in your largest table ? Cheers On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang wrote: > Hi David, > > I did, it's a wonderful piece of work and for reviewing facts in a > networks it's a great tool. (And Lumify looks really nice) > > However, my queries are mostly time-bound (from time A to time B), and to > make some query real-time (< 50ms), I have to roll out my own schema and > index, to denormalize properties and to incrementally do aggregations. I > don't think there're existing solution in Graph database that can do these. > > And it's really fun to implement it myself. :) > > Please correct me if I'm wrong > > Jianshi > > > > On Tue, Jun 24, 2014 at 10:10 PM, David Medinets > wrote: > >> Did you get a chance to review http://securegraph.org/? SecureGraph is >> an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, >> every Secure graph method requires authorizations and visibilities. >> SecureGraph also supports multivalued properties as well as property >> metadata. >> >> >> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang >> wrote: >> >>> Wow, so many replies and very educational. Thank you all! >>> >>> I'm working on a Graph backend that I hope the same infrastructure can >>> support >>> >>> 1) interactive graph exploration and queries >>> >>> Answering what are the interactions among N users from time A to time B, >>> and how are users connected (now and before). >>> >>> 2) real-time (<100ms) feature calculation (aggregation, matching) in a >>> network of accounts >>> >>> Answering questions like: what's the ratio of newly registered accounts >>> in my 'connected' (need flexible definition) network, how fast does it >>> change; Does the network has path satisfying A(CN) -> B(IT) -> C(US) where >>> the age of path is less than 3 days; etc. >>> >>> 3) offline simulation of events or offline calculation of new features >>> (used for building models), so I need to take snapshots and also save >>> point-in-time data >>> >>> Having them all-in-one in the same infrastructure will greatly simplify >>> the implementation. >>> >>> BTW, I'm working for PayPal, Risk Data Science. (All questions above are >>> fake and are not related to PayPal :) >>> >>> I made a prototype in the last two weeks for purpose 1) and my feeling >>> about Accumulo is exactly what many of you has said: it just works! Very >>> little admin work, Clean and clear documentation and APIs. One thing I >>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node, >>> but it's already very satisfying. :) >>> >>> BTW, from Mike's slides it seems HBase is much faster in read throughput >>> if the number of columns is small. Any comments? What about latency? Can I >>> cache all data in memory in Accumulo to reduce latency for cold data (say I >>> just restarted my cluster)? >>> >>> >>> Jianshi >>> >>> >>> >>> >>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum < >>> wilhelm.von.cloud@accumulo.net> wrote: >>> >>>> I think first and foremost, how has writing your application been? Is >>>> it something you can easily onboard other people for? Does it seem stable >>>> enough? If you can answer those questions positively, I think you have a >>>> winning situation. >>>> >>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all >>>> provide some level of support for Accumulo, so it has the pedigree of other >>>> members of the Hadoop ecosystem. >>>> >>>> Regarding the performance, I think Mike's presentation needs some >>>> context. He can definitely provide more context than the rest of us (and >>>> possibly Sean or Bill |-|), but I think one thing he was driving home is >>>> that out of the box, Accumulo is configured to run on someone's laptop. >>>> There are adjustments to be made when running at any scale greater than a >>>> dev machine and they may not be documented clearly. >>>> >>>> >>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra >>> > wrote: >>>> >>>>> Mike did a pretty good presentation on performance comparison between >>>>> Accumulo / HBase. Again not official IMO but is pretty detailed in the >>>>> approach take and apples-apples comparison >>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob >>>>> >>>>> >>>>> >>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014 07:42:57 >>>>> PM---Performance is probably the largest difference between Accu]Jeremy >>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the largest >>>>> difference between Accumulo and HBase. Accumulo can ingest/scan >>>>> >>>>> From: Jeremy Kepner >>>>> To: >>>>> Date: 06/23/2014 07:42 PM >>>>> Subject: Re: How does Accumulo compare to HBase >>>>> ------------------------------ >>>>> >>>>> >>>>> >>>>> Performance is probably the largest difference between Accumulo and >>>>> HBase. >>>>> >>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node. >>>>> This performance scales well into the hundreds of nodes to deliver >>>>> 100M+ entries/sec. >>>>> >>>>> There are no recent HBase benchmarks and none in the peer-reviewed >>>>> literature. >>>>> Old data suggests that HBase performance is ~1% of Accumulo >>>>> performance. >>>>> >>>>> In short, one can often replace a 20+ node database with >>>>> a single node Accumulo database. >>>>> >>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote: >>>>> > Er... basically I need to explain to my manager why choosing >>>>> Accumulo, >>>>> > instead of HBase. >>>>> > >>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98 >>>>> also >>>>> > got cell-level security, modeled after Accumulo) >>>>> > >>>>> > -- >>>>> > Jianshi Huang >>>>> > >>>>> > LinkedIn: jianshi >>>>> > Twitter: @jshuang >>>>> > Github & Blog: http://huangjs.github.com/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > --089e01635714bc8bc304fc95d97a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Jianshi:
How many column families and columns are you = expecting (maximum) in your largest table ?

Cheers=


O= n Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <jianshi.huang@gmail.c= om> wrote:
Hi David,

I did, it's a wonderful piece of work and for reviewing facts in a ne= tworks it's a great tool. (And Lumify looks really nice)

However, my queries are mostly time-bound (from time A = to time B), and to make some query real-time (< 50ms), I have to roll ou= t my own schema and index, to denormalize properties and to incrementally d= o aggregations. I don't think there're existing solution in Graph d= atabase that can do these.

And it's really fun to implement it myself. :)

Please correct me if I'm wrong

Jianshi



On Tue, Jun 24, 2014 at 10:10 PM, David Medinets= <david.medinets@gmail.com> wrote:
Did you get a chance to review=C2=A0http://securegraph.org/?=C2=A0SecureGraph is an API to manipulate graphs, similar to Blueprint= s. Unlike Blueprints, every Secure graph method requires authorizations and= visibilities. SecureGraph also supports multivalued properties as well as = property metadata.


On Tue, Jun 2= 4, 2014 at 9:51 AM, Jianshi Huang <jianshi.huang@gmail.com> wrote:
Wow, so many replies and ve= ry educational. Thank you all!

I'm working on a Grap= h backend that I hope the same infrastructure can support=C2=A0

1) interactive graph exploration and queries

Answering what are the interactions among N users from = time A to time B, and how are users connected (now and before).
<= br>
2) real-time (<100ms) feature calculation (aggregation, ma= tching) in a network of accounts

Answering questions like: what's the ratio of newly= registered accounts in my 'connected' (need flexible definition) n= etwork, how fast does it change; Does the network has path satisfying A(CN)= -> B(IT) -> C(US) where the age of path is less than 3 days; etc.

3) offline simulation of events or offline calculation = of new features (used for building models), so I need to take snapshots and= also save point-in-time data

Having them all-in-o= ne in the same infrastructure will greatly simplify the implementation.

BTW, I'm working for PayPal, Risk Data Science. (Al= l questions above are fake and are not related to PayPal :)

<= /div>
I made a prototype in the last two weeks for purpose 1) and my fe= eling about Accumulo is exactly what many of you has said: it just works! V= ery little admin work, Clean and clear documentation and APIs. One thing I = haven't got right was high-speed ingestion, I only got 100K rows/sec/no= de, but it's already very satisfying. :)

BTW, from Mike's slides it seems HBase is much fast= er in read throughput if the number of columns is small. Any comments? What= about latency? Can I cache all data in memory in Accumulo to reduce latenc= y for cold data (say I just restarted my cluster)?


Jianshi




On T= ue, Jun 24, 2014 at 10:41 AM, William Slacum <wilhelm.von.cl= oud@accumulo.net> wrote:
I think first and foremost,= how has writing your application been? Is it something you can easily onbo= ard other people for? Does it seem stable enough? If you can answer those q= uestions positively, I think you have a winning situation.

The big three Hadoop vendors (Cloudera, Hortonworks and MapR= ) all provide some level of support for Accumulo, so it has the pedigree of= other members of the Hadoop ecosystem.

Regarding = the performance, I think Mike's presentation needs some context. He can= definitely provide more context than the rest of us (and possibly Sean or = Bill |-|), but I think one thing he was driving home is that out of the box= , Accumulo is configured to run on someone's laptop. There are adjustme= nts to be made when running at any scale greater than a dev machine and the= y may not be documented clearly.


On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <= ;tsluthra@us.ibm.c= om> wrote:

Mike did a pretty good presentation on perform= ance comparison between Accumulo / HBase. Again not official IMO but is pre= tty detailed in the approach take and apples-apples comparison
http://www.slideshare.net/AccumuloSummit/1= 0-30-drob



3D"InactiveJeremy Kepner --= -06/23/2014 07:42:57 PM---Performance is probably the largest difference be= tween Accumulo and HBase. Accumulo can ingest/scan

From: Jeremy Kepner <kepner@ll.mit.edu>
To: <user@accumulo.apache.org>
Date: 06/23/2014 07:42 PM
Subject: Re: How does Accumulo compare to HBase





Performance is probably the largest difference between Accumulo a= nd HBase.

Accumulo can ingest/scan at a rate of 800K entries/sec/node.
This performance scales well into the hundreds of nodes to deliver
100M+ entries/sec.

There are no recent HBase benchmarks and none in the peer-reviewed literatu= re.
Old data suggests that HBase performance is ~1% of Accumulo performance.
In short, one can often replace a 20+ node database with
a single node Accumulo database.

On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
> Er... basically I need to explain to my manager why choosing Accumulo,=
> instead of HBase.
>
> So what are the pros and cons of Accumulo vs. HBase? (btw HBase 0.98 a= lso
> got cell-level security, modeled after Accumulo)
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog:
http://huangjs.github.com/<= font>



=


--
Jianshi Huang

LinkedIn: = jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/




--
Jianshi Huang

LinkedIn: jianshiTwitter: @jshuang
Github & Blog: http://huangjs.github.com/

--089e01635714bc8bc304fc95d97a-- --089e01635714bc8bc604fc95d97b Content-Type: image/gif; name="graycol.gif" Content-Disposition: inline; filename="graycol.gif" Content-Transfer-Encoding: base64 Content-ID: <1__=0ABBF792DF92D95D8f9e8a93df938@us.ibm.com> X-Attachment-Id: 36332c9b1693dc4b_0.1 R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7 --089e01635714bc8bc604fc95d97b--