accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Tue, 24 Jun 2014 17:50:40 GMT
Thanks for the update.

In your experiment so far, how many columns were involved ?

Cheers


On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> +Update:
>
> Possibly 100s Billion of columns.
>
>
> On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
>> Hi Ted,
>>
>> CF: maybe dozens
>> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId)
>>
>> Make sense?
>>
>> Jianshi
>>
>>
>> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>>> Jianshi:
>>> How many column families and columns are you expecting (maximum) in your
>>> largest table ?
>>>
>>> Cheers
>>>
>>>
>>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <jianshi.huang@gmail.com>
>>> wrote:
>>>
>>>> Hi David,
>>>>
>>>> I did, it's a wonderful piece of work and for reviewing facts in a
>>>> networks it's a great tool. (And Lumify looks really nice)
>>>>
>>>> However, my queries are mostly time-bound (from time A to time B), and
>>>> to make some query real-time (< 50ms), I have to roll out my own schema
and
>>>> index, to denormalize properties and to incrementally do aggregations. I
>>>> don't think there're existing solution in Graph database that can do these.
>>>>
>>>> And it's really fun to implement it myself. :)
>>>>
>>>> Please correct me if I'm wrong
>>>>
>>>> Jianshi
>>>>
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets <
>>>> david.medinets@gmail.com> wrote:
>>>>
>>>>> Did you get a chance to review http://securegraph.org/? SecureGraph
>>>>> is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints,
>>>>> every Secure graph method requires authorizations and visibilities.
>>>>> SecureGraph also supports multivalued properties as well as property
>>>>> metadata.
>>>>>
>>>>>
>>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <
>>>>> jianshi.huang@gmail.com> wrote:
>>>>>
>>>>>> Wow, so many replies and very educational. Thank you all!
>>>>>>
>>>>>> I'm working on a Graph backend that I hope the same infrastructure
>>>>>> can support
>>>>>>
>>>>>> 1) interactive graph exploration and queries
>>>>>>
>>>>>> Answering what are the interactions among N users from time A to
time
>>>>>> B, and how are users connected (now and before).
>>>>>>
>>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching)
in
>>>>>> a network of accounts
>>>>>>
>>>>>> Answering questions like: what's the ratio of newly registered
>>>>>> accounts in my 'connected' (need flexible definition) network, how
fast
>>>>>> does it change; Does the network has path satisfying A(CN) ->
B(IT) ->
>>>>>> C(US) where the age of path is less than 3 days; etc.
>>>>>>
>>>>>> 3) offline simulation of events or offline calculation of new
>>>>>> features (used for building models), so I need to take snapshots
and also
>>>>>> save point-in-time data
>>>>>>
>>>>>> Having them all-in-one in the same infrastructure will greatly
>>>>>> simplify the implementation.
>>>>>>
>>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above
>>>>>> are fake and are not related to PayPal :)
>>>>>>
>>>>>> I made a prototype in the last two weeks for purpose 1) and my
>>>>>> feeling about Accumulo is exactly what many of you has said: it just
works!
>>>>>> Very little admin work, Clean and clear documentation and APIs. One
thing I
>>>>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node,
>>>>>> but it's already very satisfying. :)
>>>>>>
>>>>>> BTW, from Mike's slides it seems HBase is much faster in read
>>>>>> throughput if the number of columns is small. Any comments? What
about
>>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency
for
>>>>>> cold data (say I just restarted my cluster)?
>>>>>>
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum <
>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>
>>>>>>> I think first and foremost, how has writing your application
been?
>>>>>>> Is it something you can easily onboard other people for? Does
it seem
>>>>>>> stable enough? If you can answer those questions positively,
I think you
>>>>>>> have a winning situation.
>>>>>>>
>>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR)
all
>>>>>>> provide some level of support for Accumulo, so it has the pedigree
of other
>>>>>>> members of the Hadoop ecosystem.
>>>>>>>
>>>>>>> Regarding the performance, I think Mike's presentation needs
some
>>>>>>> context. He can definitely provide more context than the rest
of us (and
>>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving
home is
>>>>>>> that out of the box, Accumulo is configured to run on someone's
laptop.
>>>>>>> There are adjustments to be made when running at any scale greater
than a
>>>>>>> dev machine and they may not be documented clearly.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <
>>>>>>> tsluthra@us.ibm.com> wrote:
>>>>>>>
>>>>>>>> Mike did a pretty good presentation on performance comparison
>>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty
detailed in
>>>>>>>> the approach take and apples-apples comparison
>>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014
>>>>>>>> 07:42:57 PM---Performance is probably the largest difference
between Accu]Jeremy
>>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably
the largest
>>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan
>>>>>>>>
>>>>>>>> From: Jeremy Kepner <kepner@ll.mit.edu>
>>>>>>>> To: <user@accumulo.apache.org>
>>>>>>>> Date: 06/23/2014 07:42 PM
>>>>>>>> Subject: Re: How does Accumulo compare to HBase
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Performance is probably the largest difference between Accumulo
and
>>>>>>>> HBase.
>>>>>>>>
>>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node.
>>>>>>>> This performance scales well into the hundreds of nodes to
deliver
>>>>>>>> 100M+ entries/sec.
>>>>>>>>
>>>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed
>>>>>>>> literature.
>>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> In short, one can often replace a 20+ node database with
>>>>>>>> a single node Accumulo database.
>>>>>>>>
>>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
>>>>>>>> > Er... basically I need to explain to my manager why
choosing
>>>>>>>> Accumulo,
>>>>>>>> > instead of HBase.
>>>>>>>> >
>>>>>>>> > So what are the pros and cons of Accumulo vs. HBase?
(btw HBase
>>>>>>>> 0.98 also
>>>>>>>> > got cell-level security, modeled after Accumulo)
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Jianshi Huang
>>>>>>>> >
>>>>>>>> > LinkedIn: jianshi
>>>>>>>> > Twitter: @jshuang
>>>>>>>> > Github & Blog: http://huangjs.github.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Mime
View raw message