accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Apache Accumulo integrated with Presto
Date Mon, 13 Jun 2016 11:20:50 GMT
Thanks for that summary, Dylan! Very helpful.

On Mon, Jun 13, 2016, 01:36 Dylan Hutchison <dhutchis@cs.washington.edu>
wrote:

> Thanks for sharing Sean.  Here are some notes I wrote after reading the
> article on Presto-Accumulo design.  I have a research interest in the
> relationship between relational (SQL) and non-relational (Accumulo)
> systems, so I couldn't resist reading the post in detail.
>
>    - Places the primary key in the Accumulo row.
>    - Performs row-at-a-time processing (each tuple is one row in
>    Accumulo) using WholeRowIterator behavior.
>    - Relational table metadata is stored in the Presto infrastructure (as
>    opposed to an Accumulo table).
>    - Supports the creation of index tables for any attributes. These
>    index tables speed up queries that filter on indexed attributes.  It is
>    standard secondary indexing, which provides speedups when the selectivity
>    of the query is roughly <10% of the original table.
>    - Only database->client querying is supported.  You cannot run "select
>    ... into result_table".
>    - As far as I can see, Presto only has one join strategy: *broadcast
>    join*.  The right table of every join is scanned into one of the
>    Presto worker's memory.  Subsequently the size of the right table is
>    limited by worker memory.
>    - There is one Presto worker for each Accumulo tablet, which enables
>    good scaling.
>    - The Presto bridge classes track internal Accumulo information such
>    as the assignment of tablets to tablet servers by reading Accumulo's
>    Metadata table. Presto uses tablet locations to provide better locality.
>    - The Presto bridge comes with several Accumulo server-side iterators
>    for filtering and aggregating.
>    - The code is quite nice and clean.
>
> This image below gives Presto's architecture.  Accumulo takes the role of
> the DB icon in the bottom-right corner.
>
> [image: Inline image 2]
>
> Bloomberg ran 13 out of the 22 TPC-H queries.  There is no fundamental
> reason why they cannot run all the queries; they just have not implemented
> everything required ('exists' clauses, non-equi join, etc.).
>
> The interface looks like this, though they use a compiled java jar to
> insert entries from a csv file (it wraps around a BatchWriter).
>
> [image: Inline image 3]
>
> Here are performance results.  They don't say what hardware or data sizes
> they use.  Whatever it is, they must have the ability to fit the smaller
> table of any join into memory as a result of Presto's broadcast join
> strategy.  The strong scaling looks very nice.
>
> [image: Inline image 4]
>
> They have one other plot that shows how secondary indexing speeds up some
> queries with low selectivity.
>
> Cheers, Dylan
>
>
>
> On Sun, Jun 12, 2016 at 7:06 PM, Sean Busbey <busbey@cloudera.com> wrote:
>
>> Bloomberg have a post about a connector they made to query Accumulo from
>> Presto:
>>
>>
>> http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
>>
>> --
>> Sean Busbey
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message