accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Tue, 24 Jun 2014 18:01:28 GMT
Ted:

+1.5B columns
  - 5 CF
  - 300M CQ

Jianshi


On Wed, Jun 25, 2014 at 1:50 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Thanks for the update.
>
> In your experiment so far, how many columns were involved ?
>
> Cheers
>
>
> On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
>> +Update:
>>
>> Possibly 100s Billion of columns.
>>
>>
>> On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> Hi Ted,
>>>
>>> CF: maybe dozens
>>> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId)
>>>
>>> Make sense?
>>>
>>> Jianshi
>>>
>>>
>>> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>>> Jianshi:
>>>> How many column families and columns are you expecting (maximum) in
>>>> your largest table ?
>>>>
>>>> Cheers
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> I did, it's a wonderful piece of work and for reviewing facts in a
>>>>> networks it's a great tool. (And Lumify looks really nice)
>>>>>
>>>>> However, my queries are mostly time-bound (from time A to time B), and
>>>>> to make some query real-time (< 50ms), I have to roll out my own schema
and
>>>>> index, to denormalize properties and to incrementally do aggregations.
I
>>>>> don't think there're existing solution in Graph database that can do
these.
>>>>>
>>>>> And it's really fun to implement it myself. :)
>>>>>
>>>>> Please correct me if I'm wrong
>>>>>
>>>>> Jianshi
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets <
>>>>> david.medinets@gmail.com> wrote:
>>>>>
>>>>>> Did you get a chance to review http://securegraph.org/? SecureGraph
>>>>>> is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints,
>>>>>> every Secure graph method requires authorizations and visibilities.
>>>>>> SecureGraph also supports multivalued properties as well as property
>>>>>> metadata.
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Wow, so many replies and very educational. Thank you all!
>>>>>>>
>>>>>>> I'm working on a Graph backend that I hope the same infrastructure
>>>>>>> can support
>>>>>>>
>>>>>>> 1) interactive graph exploration and queries
>>>>>>>
>>>>>>> Answering what are the interactions among N users from time A
to
>>>>>>> time B, and how are users connected (now and before).
>>>>>>>
>>>>>>> 2) real-time (<100ms) feature calculation (aggregation, matching)
in
>>>>>>> a network of accounts
>>>>>>>
>>>>>>> Answering questions like: what's the ratio of newly registered
>>>>>>> accounts in my 'connected' (need flexible definition) network,
how fast
>>>>>>> does it change; Does the network has path satisfying A(CN) ->
B(IT) ->
>>>>>>> C(US) where the age of path is less than 3 days; etc.
>>>>>>>
>>>>>>> 3) offline simulation of events or offline calculation of new
>>>>>>> features (used for building models), so I need to take snapshots
and also
>>>>>>> save point-in-time data
>>>>>>>
>>>>>>> Having them all-in-one in the same infrastructure will greatly
>>>>>>> simplify the implementation.
>>>>>>>
>>>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions
above
>>>>>>> are fake and are not related to PayPal :)
>>>>>>>
>>>>>>> I made a prototype in the last two weeks for purpose 1) and my
>>>>>>> feeling about Accumulo is exactly what many of you has said:
it just works!
>>>>>>> Very little admin work, Clean and clear documentation and APIs.
One thing I
>>>>>>> haven't got right was high-speed ingestion, I only got 100K rows/sec/node,
>>>>>>> but it's already very satisfying. :)
>>>>>>>
>>>>>>> BTW, from Mike's slides it seems HBase is much faster in read
>>>>>>> throughput if the number of columns is small. Any comments? What
about
>>>>>>> latency? Can I cache all data in memory in Accumulo to reduce
latency for
>>>>>>> cold data (say I just restarted my cluster)?
>>>>>>>
>>>>>>>
>>>>>>> Jianshi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum <
>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>
>>>>>>>> I think first and foremost, how has writing your application
been?
>>>>>>>> Is it something you can easily onboard other people for?
Does it seem
>>>>>>>> stable enough? If you can answer those questions positively,
I think you
>>>>>>>> have a winning situation.
>>>>>>>>
>>>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and MapR)
all
>>>>>>>> provide some level of support for Accumulo, so it has the
pedigree of other
>>>>>>>> members of the Hadoop ecosystem.
>>>>>>>>
>>>>>>>> Regarding the performance, I think Mike's presentation needs
some
>>>>>>>> context. He can definitely provide more context than the
rest of us (and
>>>>>>>> possibly Sean or Bill |-|), but I think one thing he was
driving home is
>>>>>>>> that out of the box, Accumulo is configured to run on someone's
laptop.
>>>>>>>> There are adjustments to be made when running at any scale
greater than a
>>>>>>>> dev machine and they may not be documented clearly.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <
>>>>>>>> tsluthra@us.ibm.com> wrote:
>>>>>>>>
>>>>>>>>> Mike did a pretty good presentation on performance comparison
>>>>>>>>> between Accumulo / HBase. Again not official IMO but
is pretty detailed in
>>>>>>>>> the approach take and apples-apples comparison
>>>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014
>>>>>>>>> 07:42:57 PM---Performance is probably the largest difference
between Accu]Jeremy
>>>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is probably
the largest
>>>>>>>>> difference between Accumulo and HBase. Accumulo can ingest/scan
>>>>>>>>>
>>>>>>>>> From: Jeremy Kepner <kepner@ll.mit.edu>
>>>>>>>>> To: <user@accumulo.apache.org>
>>>>>>>>> Date: 06/23/2014 07:42 PM
>>>>>>>>> Subject: Re: How does Accumulo compare to HBase
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Performance is probably the largest difference between
Accumulo
>>>>>>>>> and HBase.
>>>>>>>>>
>>>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node.
>>>>>>>>> This performance scales well into the hundreds of nodes
to deliver
>>>>>>>>> 100M+ entries/sec.
>>>>>>>>>
>>>>>>>>> There are no recent HBase benchmarks and none in the
peer-reviewed
>>>>>>>>> literature.
>>>>>>>>> Old data suggests that HBase performance is ~1% of Accumulo
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> In short, one can often replace a 20+ node database with
>>>>>>>>> a single node Accumulo database.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi Huang
wrote:
>>>>>>>>> > Er... basically I need to explain to my manager
why choosing
>>>>>>>>> Accumulo,
>>>>>>>>> > instead of HBase.
>>>>>>>>> >
>>>>>>>>> > So what are the pros and cons of Accumulo vs. HBase?
(btw HBase
>>>>>>>>> 0.98 also
>>>>>>>>> > got cell-level security, modeled after Accumulo)
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Jianshi Huang
>>>>>>>>> >
>>>>>>>>> > LinkedIn: jianshi
>>>>>>>>> > Twitter: @jshuang
>>>>>>>>> > Github & Blog: http://huangjs.github.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
View raw message