accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Tue, 24 Jun 2014 17:44:58 GMT
+Update:

Possibly 100s Billion of columns.


On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Hi Ted,
>
> CF: maybe dozens
> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId)
>
> Make sense?
>
> Jianshi
>
>
> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Jianshi:
>> How many column families and columns are you expecting (maximum) in your
>> largest table ?
>>
>> Cheers
>>
>>
>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Hi David,
>>>
>>> I did, it's a wonderful piece of work and for reviewing facts in a
>>> networks it's a great tool. (And Lumify looks really nice)
>>>
>>> However, my queries are mostly time-bound (from time A to time B), and
>>> to make some query real-time (< 50ms), I have to roll out my own schema and
>>> index, to denormalize properties and to incrementally do aggregations. I
>>> don't think there're existing solution in Graph database that can do these.
>>>
>>> And it's really fun to implement it myself. :)
>>>
>>> Please correct me if I'm wrong
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets <
>>> david.medinets@gmail.com> wrote:
>>>
>>>> Did you get a chance to review http://securegraph.org/? SecureGraph is
>>>> an API to manipulate graphs, similar to Blueprints. Unlike Blueprints,
>>>> every Secure graph method requires authorizations and visibilities.
>>>> SecureGraph also supports multivalued properties as well as property
>>>> metadata.
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>> > wrote:
>>>>
>>>>> Wow, so many replies and very educational. Thank you all!
>>>>>
>>>>> I'm working on a Graph backend that I hope the same infrastructure can
>>>>> support
>>>>>
>>>>> 1) interactive graph exploration and queries
>>>>>
>>>>> Answering what are the interactions among N users from time A to time
>>>>> B, and how are users connected (now and before).
>>>>>
>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching)
in a
>>>>> network of accounts
>>>>>
>>>>> Answering questions like: what's the ratio of newly registered
>>>>> accounts in my 'connected' (need flexible definition) network, how fast
>>>>> does it change; Does the network has path satisfying A(CN) -> B(IT)
->
>>>>> C(US) where the age of path is less than 3 days; etc.
>>>>>
>>>>> 3) offline simulation of events or offline calculation of new features
>>>>> (used for building models), so I need to take snapshots and also save
>>>>> point-in-time data
>>>>>
>>>>> Having them all-in-one in the same infrastructure will greatly
>>>>> simplify the implementation.
>>>>>
>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions above
>>>>> are fake and are not related to PayPal :)
>>>>>
>>>>> I made a prototype in the last two weeks for purpose 1) and my feeling
>>>>> about Accumulo is exactly what many of you has said: it just works! Very
>>>>> little admin work, Clean and clear documentation and APIs. One thing
I
>>>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node,
>>>>> but it's already very satisfying. :)
>>>>>
>>>>> BTW, from Mike's slides it seems HBase is much faster in read
>>>>> throughput if the number of columns is small. Any comments? What about
>>>>> latency? Can I cache all data in memory in Accumulo to reduce latency
for
>>>>> cold data (say I just restarted my cluster)?
>>>>>
>>>>>
>>>>> Jianshi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum <
>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>
>>>>>> I think first and foremost, how has writing your application been?
Is
>>>>>> it something you can easily onboard other people for? Does it seem
stable
>>>>>> enough? If you can answer those questions positively, I think you
have a
>>>>>> winning situation.
>>>>>>
>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR) all
>>>>>> provide some level of support for Accumulo, so it has the pedigree
of other
>>>>>> members of the Hadoop ecosystem.
>>>>>>
>>>>>> Regarding the performance, I think Mike's presentation needs some
>>>>>> context. He can definitely provide more context than the rest of
us (and
>>>>>> possibly Sean or Bill |-|), but I think one thing he was driving
home is
>>>>>> that out of the box, Accumulo is configured to run on someone's laptop.
>>>>>> There are adjustments to be made when running at any scale greater
than a
>>>>>> dev machine and they may not be documented clearly.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <
>>>>>> tsluthra@us.ibm.com> wrote:
>>>>>>
>>>>>>> Mike did a pretty good presentation on performance comparison
>>>>>>> between Accumulo / HBase. Again not official IMO but is pretty
detailed in
>>>>>>> the approach take and apples-apples comparison
>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014
>>>>>>> 07:42:57 PM---Performance is probably the largest difference
between Accu]Jeremy
>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably the
largest
>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan
>>>>>>>
>>>>>>> From: Jeremy Kepner <kepner@ll.mit.edu>
>>>>>>> To: <user@accumulo.apache.org>
>>>>>>> Date: 06/23/2014 07:42 PM
>>>>>>> Subject: Re: How does Accumulo compare to HBase
>>>>>>> ------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Performance is probably the largest difference between Accumulo
and
>>>>>>> HBase.
>>>>>>>
>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node.
>>>>>>> This performance scales well into the hundreds of nodes to deliver
>>>>>>> 100M+ entries/sec.
>>>>>>>
>>>>>>> There are no recent HBase benchmarks and none in the peer-reviewed
>>>>>>> literature.
>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo
>>>>>>> performance.
>>>>>>>
>>>>>>> In short, one can often replace a 20+ node database with
>>>>>>> a single node Accumulo database.
>>>>>>>
>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang wrote:
>>>>>>> > Er... basically I need to explain to my manager why choosing
>>>>>>> Accumulo,
>>>>>>> > instead of HBase.
>>>>>>> >
>>>>>>> > So what are the pros and cons of Accumulo vs. HBase? (btw
HBase
>>>>>>> 0.98 also
>>>>>>> > got cell-level security, modeled after Accumulo)
>>>>>>> >
>>>>>>> > --
>>>>>>> > Jianshi Huang
>>>>>>> >
>>>>>>> > LinkedIn: jianshi
>>>>>>> > Twitter: @jshuang
>>>>>>> > Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
View raw message