accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Tue, 24 Jun 2014 18:02:43 GMT
Ted:

Sorry, wrong number, this one is correct:

+ 10.5B columns
  - 5CF
  - ~2B CQ

Jianshi


On Wed, Jun 25, 2014 at 2:01 AM, Jianshi Huang <jianshi.huang@gmail.com>
wrote:

> Ted:
>
> +1.5B columns
>   - 5 CF
>   - 300M CQ
>
> Jianshi
>
>
> On Wed, Jun 25, 2014 at 1:50 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Thanks for the update.
>>
>> In your experiment so far, how many columns were involved ?
>>
>> Cheers
>>
>>
>> On Tue, Jun 24, 2014 at 10:44 AM, Jianshi Huang <jianshi.huang@gmail.com>
>> wrote:
>>
>>> +Update:
>>>
>>> Possibly 100s Billion of columns.
>>>
>>>
>>> On Wed, Jun 25, 2014 at 12:03 AM, Jianshi Huang <jianshi.huang@gmail.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> CF: maybe dozens
>>>> Columns: billions (rowkey = nodeId, CF = event type, CQ = Index+eventId)
>>>>
>>>> Make sense?
>>>>
>>>> Jianshi
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 10:33 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>>> Jianshi:
>>>>> How many column families and columns are you expecting (maximum) in
>>>>> your largest table ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> On Tue, Jun 24, 2014 at 7:29 AM, Jianshi Huang <
>>>>> jianshi.huang@gmail.com> wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> I did, it's a wonderful piece of work and for reviewing facts in
a
>>>>>> networks it's a great tool. (And Lumify looks really nice)
>>>>>>
>>>>>> However, my queries are mostly time-bound (from time A to time B),
>>>>>> and to make some query real-time (< 50ms), I have to roll out
my own schema
>>>>>> and index, to denormalize properties and to incrementally do aggregations.
>>>>>> I don't think there're existing solution in Graph database that can
do
>>>>>> these.
>>>>>>
>>>>>> And it's really fun to implement it myself. :)
>>>>>>
>>>>>> Please correct me if I'm wrong
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 24, 2014 at 10:10 PM, David Medinets <
>>>>>> david.medinets@gmail.com> wrote:
>>>>>>
>>>>>>> Did you get a chance to review http://securegraph.org/? SecureGraph
>>>>>>> is an API to manipulate graphs, similar to Blueprints. Unlike
Blueprints,
>>>>>>> every Secure graph method requires authorizations and visibilities.
>>>>>>> SecureGraph also supports multivalued properties as well as property
>>>>>>> metadata.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 24, 2014 at 9:51 AM, Jianshi Huang <
>>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>>
>>>>>>>> Wow, so many replies and very educational. Thank you all!
>>>>>>>>
>>>>>>>> I'm working on a Graph backend that I hope the same infrastructure
>>>>>>>> can support
>>>>>>>>
>>>>>>>> 1) interactive graph exploration and queries
>>>>>>>>
>>>>>>>> Answering what are the interactions among N users from time
A to
>>>>>>>> time B, and how are users connected (now and before).
>>>>>>>>
>>>>>>>> 2) real-time (<100ms) feature calculation (aggregation,
matching)
>>>>>>>> in a network of accounts
>>>>>>>>
>>>>>>>> Answering questions like: what's the ratio of newly registered
>>>>>>>> accounts in my 'connected' (need flexible definition) network,
how fast
>>>>>>>> does it change; Does the network has path satisfying A(CN)
-> B(IT) ->
>>>>>>>> C(US) where the age of path is less than 3 days; etc.
>>>>>>>>
>>>>>>>> 3) offline simulation of events or offline calculation of
new
>>>>>>>> features (used for building models), so I need to take snapshots
and also
>>>>>>>> save point-in-time data
>>>>>>>>
>>>>>>>> Having them all-in-one in the same infrastructure will greatly
>>>>>>>> simplify the implementation.
>>>>>>>>
>>>>>>>> BTW, I'm working for PayPal, Risk Data Science. (All questions
>>>>>>>> above are fake and are not related to PayPal :)
>>>>>>>>
>>>>>>>> I made a prototype in the last two weeks for purpose 1) and
my
>>>>>>>> feeling about Accumulo is exactly what many of you has said:
it just works!
>>>>>>>> Very little admin work, Clean and clear documentation and
APIs. One thing I
>>>>>>>> haven't got right was high-speed ingestion, I only got 100K
rows/sec/node,
>>>>>>>> but it's already very satisfying. :)
>>>>>>>>
>>>>>>>> BTW, from Mike's slides it seems HBase is much faster in
read
>>>>>>>> throughput if the number of columns is small. Any comments?
What about
>>>>>>>> latency? Can I cache all data in memory in Accumulo to reduce
latency for
>>>>>>>> cold data (say I just restarted my cluster)?
>>>>>>>>
>>>>>>>>
>>>>>>>> Jianshi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 24, 2014 at 10:41 AM, William Slacum <
>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>
>>>>>>>>> I think first and foremost, how has writing your application
been?
>>>>>>>>> Is it something you can easily onboard other people for?
Does it seem
>>>>>>>>> stable enough? If you can answer those questions positively,
I think you
>>>>>>>>> have a winning situation.
>>>>>>>>>
>>>>>>>>> The big three Hadoop vendors (Cloudera, Hortonworks and
MapR) all
>>>>>>>>> provide some level of support for Accumulo, so it has
the pedigree of other
>>>>>>>>> members of the Hadoop ecosystem.
>>>>>>>>>
>>>>>>>>> Regarding the performance, I think Mike's presentation
needs some
>>>>>>>>> context. He can definitely provide more context than
the rest of us (and
>>>>>>>>> possibly Sean or Bill |-|), but I think one thing he
was driving home is
>>>>>>>>> that out of the box, Accumulo is configured to run on
someone's laptop.
>>>>>>>>> There are adjustments to be made when running at any
scale greater than a
>>>>>>>>> dev machine and they may not be documented clearly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 23, 2014 at 8:16 PM, Tejinder S Luthra <
>>>>>>>>> tsluthra@us.ibm.com> wrote:
>>>>>>>>>
>>>>>>>>>> Mike did a pretty good presentation on performance
comparison
>>>>>>>>>> between Accumulo / HBase. Again not official IMO
but is pretty detailed in
>>>>>>>>>> the approach take and apples-apples comparison
>>>>>>>>>> http://www.slideshare.net/AccumuloSummit/10-30-drob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [image: Inactive hide details for Jeremy Kepner ---06/23/2014
>>>>>>>>>> 07:42:57 PM---Performance is probably the largest
difference between Accu]Jeremy
>>>>>>>>>> Kepner ---06/23/2014 07:42:57 PM---Performance is
probably the largest
>>>>>>>>>> difference between Accumulo and HBase. Accumulo can
ingest/scan
>>>>>>>>>>
>>>>>>>>>> From: Jeremy Kepner <kepner@ll.mit.edu>
>>>>>>>>>> To: <user@accumulo.apache.org>
>>>>>>>>>> Date: 06/23/2014 07:42 PM
>>>>>>>>>> Subject: Re: How does Accumulo compare to HBase
>>>>>>>>>> ------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Performance is probably the largest difference between
Accumulo
>>>>>>>>>> and HBase.
>>>>>>>>>>
>>>>>>>>>> Accumulo can ingest/scan at a rate of 800K entries/sec/node.
>>>>>>>>>> This performance scales well into the hundreds of
nodes to deliver
>>>>>>>>>> 100M+ entries/sec.
>>>>>>>>>>
>>>>>>>>>> There are no recent HBase benchmarks and none in
the
>>>>>>>>>> peer-reviewed literature.
>>>>>>>>>> Old data suggests that HBase performance is ~1% of
Accumulo
>>>>>>>>>> performance.
>>>>>>>>>>
>>>>>>>>>> In short, one can often replace a 20+ node database
with
>>>>>>>>>> a single node Accumulo database.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 24, 2014 at 01:55:54AM +0800, Jianshi
Huang wrote:
>>>>>>>>>> > Er... basically I need to explain to my manager
why choosing
>>>>>>>>>> Accumulo,
>>>>>>>>>> > instead of HBase.
>>>>>>>>>> >
>>>>>>>>>> > So what are the pros and cons of Accumulo vs.
HBase? (btw HBase
>>>>>>>>>> 0.98 also
>>>>>>>>>> > got cell-level security, modeled after Accumulo)
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> > Jianshi Huang
>>>>>>>>>> >
>>>>>>>>>> > LinkedIn: jianshi
>>>>>>>>>> > Twitter: @jshuang
>>>>>>>>>> > Github & Blog: http://huangjs.github.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>>
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
View raw message