accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Coetzee <pe...@coetzee.org>
Subject Re: New research using Accumulo: Unified Secure On-/Off-line Analytics
Date Tue, 21 Oct 2014 08:23:12 GMT
Hi Jeremy,

If you're viewing the PDF form of the paper (Elsevier's HTML rendering has
some odd artefacts), there's a short explanation of the figure appearing
just after it:

At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> outperform a native implementation making use of the more expressive Spark
> builtins. Performing bulk analysis through the use of Accumulo Iterators
> with CRUCIBLE was approximately 10x slower than the equivalent native
> implementation; with Spark on HDFS files, this is now almost 1.2x faster
> than the native implementation used.


The "native" implementations (i.e. hand-written by an engineer using the
tools offered by the standard platform) are shown as dashed series on the
chart, while the other series represent a single CRUCIBLE topology,
compiled once and executed on a collection of runtimes (each of which are
discussed in more detail earlier in the paper).

By way of clarification; are you curious as to what the figure shows, or
why those results are demonstrated?

Hope this helps somewhat.

Best regards,
Peter



On 21 October 2014 00:19, Jeremy Kepner <kepner@ll.mit.edu> wrote:

> Hi Peter,
>   Thanks.  Can you clarify Figure 12 in the paper.  I think I understand
> what it is saying, but I am not 100% sure.
>
> Regards.  -Jeremy
>
> On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > New open-access research published in the journal of Parallel Computing
> > demonstrates a novel approach to engineering analytics for deployment in
> > streaming and batch contexts.
> >
> > Increasing numbers of users are extracting real value from their data
> using
> > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> > Spark across their historical data in Accumulo.
> >
> > Until now, there hasn't been an approach which permits the use of these
> > tools from a single shared codebase, with deployment considerations
> > reserved until deployment time. Furthermore, it has been even harder to
> > permit this unified analysis while maintaining cell-level traces of the
> > security heritage for each datum an analytic produces.
> >
> > Some highlights of the paper include:
> >   - A domain specific language (CRUCIBLE) and runtime models for on- and
> > off-line data analytics.
> >   - Detailed analysis of CRUCIBLE’s runtime performance in
> state-of-the-art
> > environments.
> >   - Development and detailed analysis of a set of runtime models for new
> > environments.
> >   - Performance comparison with native implementations and discussion of
> > optimisation steps.
> >   - Formulation of a primitive in the DSL that permits an analytic to be
> > run over multiple data sources.
> >
> > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> > available free of charge from Elsevier:
> >
> > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> >
> >
> > I am one of the lead authors of the work, and would be more than happy to
> > discuss any aspects which catch your attention!
> >
> > Peter
> >
> > --
> > Peter Coetzee
> > Performance Computing and Visualisation PhD Candidate
> > Department of Computer Science
> > University of Warwick
>

Mime
View raw message