accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Coetzee <pe...@coetzee.org>
Subject Re: New research using Accumulo: Unified Secure On-/Off-line Analytics
Date Tue, 21 Oct 2014 14:28:54 GMT
Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented with
CUCIBLE, each being the same CRUCIBLE code, but executed against a
different runtime configuration.

Accumulo v1 represents the pre-optimisation Accumulo Iterator based runtime
Accumulo v2 represents the post-optimisation Accumulo Iterator based runtime
Spark-Accumulo makes use of a Standalone Spark cluster, backed by Accumulo
on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
Spark-HDFS uses the same Standalone Spark cluster, but is operating over
files in HDFS directly



On 21 October 2014 15:07, Jeremy Kepner <kepner@ll.mit.edu> wrote:

> So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> which were implemented with  CRUCIBLE
>
> On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > Hi Jeremy,
> >
> > If you're viewing the PDF form of the paper (Elsevier's HTML rendering
> has
> > some odd artefacts), there's a short explanation of the figure appearing
> > just after it:
> >
> > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > > outperform a native implementation making use of the more expressive
> Spark
> > > builtins. Performing bulk analysis through the use of Accumulo
> Iterators
> > > with CRUCIBLE was approximately 10x slower than the equivalent native
> > > implementation; with Spark on HDFS files, this is now almost 1.2x
> faster
> > > than the native implementation used.
> >
> >
> > The "native" implementations (i.e. hand-written by an engineer using the
> > tools offered by the standard platform) are shown as dashed series on the
> > chart, while the other series represent a single CRUCIBLE topology,
> > compiled once and executed on a collection of runtimes (each of which are
> > discussed in more detail earlier in the paper).
> >
> > By way of clarification; are you curious as to what the figure shows, or
> > why those results are demonstrated?
> >
> > Hope this helps somewhat.
> >
> > Best regards,
> > Peter
> >
> >
> >
> > On 21 October 2014 00:19, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> >
> > > Hi Peter,
> > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> understand
> > > what it is saying, but I am not 100% sure.
> > >
> > > Regards.  -Jeremy
> > >
> > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > New open-access research published in the journal of Parallel
> Computing
> > > > demonstrates a novel approach to engineering analytics for
> deployment in
> > > > streaming and batch contexts.
> > > >
> > > > Increasing numbers of users are extracting real value from their data
> > > using
> > > > tools like IBM InfoSphere Streams for near-real-time analysis and
> Apache
> > > > Spark across their historical data in Accumulo.
> > > >
> > > > Until now, there hasn't been an approach which permits the use of
> these
> > > > tools from a single shared codebase, with deployment considerations
> > > > reserved until deployment time. Furthermore, it has been even harder
> to
> > > > permit this unified analysis while maintaining cell-level traces of
> the
> > > > security heritage for each datum an analytic produces.
> > > >
> > > > Some highlights of the paper include:
> > > >   - A domain specific language (CRUCIBLE) and runtime models for on-
> and
> > > > off-line data analytics.
> > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > state-of-the-art
> > > > environments.
> > > >   - Development and detailed analysis of a set of runtime models for
> new
> > > > environments.
> > > >   - Performance comparison with native implementations and
> discussion of
> > > > optimisation steps.
> > > >   - Formulation of a primitive in the DSL that permits an analytic
> to be
> > > > run over multiple data sources.
> > > >
> > > > The paper, Towards Unified Secure On- and Off-line Analytics at
> Scale, is
> > > > available free of charge from Elsevier:
> > > >
> > > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > >
> > > >
> > > > I am one of the lead authors of the work, and would be more than
> happy to
> > > > discuss any aspects which catch your attention!
> > > >
> > > > Peter
> > > >
> > > > --
> > > > Peter Coetzee
> > > > Performance Computing and Visualisation PhD Candidate
> > > > Department of Computer Science
> > > > University of Warwick
> > >
>

Mime
View raw message