accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Coetzee <pe...@coetzee.org>
Subject Re: New research using Accumulo: Unified Secure On-/Off-line Analytics
Date Tue, 21 Oct 2014 14:50:12 GMT
That's correct, yes (and hopefully the body text agrees with your reading
of the data?). The 10x slowdown is one of the reasons I suggest that
complex networks of Iterators are probably not a sound approach to
implementing CRUCIBLE-style analytics, although it can be made to work.
This motivated the implementation of the Spark runtime for CRUCIBLE, which
gives a couple of orders of magnitude better performance (the gap between
Accumulo v1 and Spark-Accumulo is around 480x, I believe).

There's something to be said for using the right tool for the job :)

All the best,
Peter


On 21 October 2014 15:28, Jeremy Kepner <kepner@ll.mit.edu> wrote:

> Hi Peter,
>   So the Y axis is labeled "Execution Time (s)" which would imply
> "Accumulo v2" using CRUCIBLE is 10 times slower than the "Native Accumulo"
> which doesn't use CRUCIBLE.  Is this correct?
>
> Regards.  -Jeremy
>
> On Tue, Oct 21, 2014 at 03:28:54PM +0100, Peter Coetzee wrote:
> > Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented
> with
> > CUCIBLE, each being the same CRUCIBLE code, but executed against a
> > different runtime configuration.
> >
> > Accumulo v1 represents the pre-optimisation Accumulo Iterator based
> runtime
> > Accumulo v2 represents the post-optimisation Accumulo Iterator based
> runtime
> > Spark-Accumulo makes use of a Standalone Spark cluster, backed by
> Accumulo
> > on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
> > Spark-HDFS uses the same Standalone Spark cluster, but is operating over
> > files in HDFS directly
> >
> >
> >
> > On 21 October 2014 15:07, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> >
> > > So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> > > Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> > > which were implemented with  CRUCIBLE
> > >
> > > On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > > > Hi Jeremy,
> > > >
> > > > If you're viewing the PDF form of the paper (Elsevier's HTML
> rendering
> > > has
> > > > some odd artefacts), there's a short explanation of the figure
> appearing
> > > > just after it:
> > > >
> > > > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen
> to
> > > > > outperform a native implementation making use of the more
> expressive
> > > Spark
> > > > > builtins. Performing bulk analysis through the use of Accumulo
> > > Iterators
> > > > > with CRUCIBLE was approximately 10x slower than the equivalent
> native
> > > > > implementation; with Spark on HDFS files, this is now almost 1.2x
> > > faster
> > > > > than the native implementation used.
> > > >
> > > >
> > > > The "native" implementations (i.e. hand-written by an engineer using
> the
> > > > tools offered by the standard platform) are shown as dashed series
> on the
> > > > chart, while the other series represent a single CRUCIBLE topology,
> > > > compiled once and executed on a collection of runtimes (each of
> which are
> > > > discussed in more detail earlier in the paper).
> > > >
> > > > By way of clarification; are you curious as to what the figure
> shows, or
> > > > why those results are demonstrated?
> > > >
> > > > Hope this helps somewhat.
> > > >
> > > > Best regards,
> > > > Peter
> > > >
> > > >
> > > >
> > > > On 21 October 2014 00:19, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> > > >
> > > > > Hi Peter,
> > > > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> > > understand
> > > > > what it is saying, but I am not 100% sure.
> > > > >
> > > > > Regards.  -Jeremy
> > > > >
> > > > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > > > New open-access research published in the journal of Parallel
> > > Computing
> > > > > > demonstrates a novel approach to engineering analytics for
> > > deployment in
> > > > > > streaming and batch contexts.
> > > > > >
> > > > > > Increasing numbers of users are extracting real value from their
> data
> > > > > using
> > > > > > tools like IBM InfoSphere Streams for near-real-time analysis
and
> > > Apache
> > > > > > Spark across their historical data in Accumulo.
> > > > > >
> > > > > > Until now, there hasn't been an approach which permits the use
of
> > > these
> > > > > > tools from a single shared codebase, with deployment
> considerations
> > > > > > reserved until deployment time. Furthermore, it has been even
> harder
> > > to
> > > > > > permit this unified analysis while maintaining cell-level traces
> of
> > > the
> > > > > > security heritage for each datum an analytic produces.
> > > > > >
> > > > > > Some highlights of the paper include:
> > > > > >   - A domain specific language (CRUCIBLE) and runtime models
for
> on-
> > > and
> > > > > > off-line data analytics.
> > > > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > > > state-of-the-art
> > > > > > environments.
> > > > > >   - Development and detailed analysis of a set of runtime models
> for
> > > new
> > > > > > environments.
> > > > > >   - Performance comparison with native implementations and
> > > discussion of
> > > > > > optimisation steps.
> > > > > >   - Formulation of a primitive in the DSL that permits an
> analytic
> > > to be
> > > > > > run over multiple data sources.
> > > > > >
> > > > > > The paper, Towards Unified Secure On- and Off-line Analytics
at
> > > Scale, is
> > > > > > available free of charge from Elsevier:
> > > > > >
> > > > > >
> http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > > > >
> > > > > >
> > > > > > I am one of the lead authors of the work, and would be more
than
> > > happy to
> > > > > > discuss any aspects which catch your attention!
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > > --
> > > > > > Peter Coetzee
> > > > > > Performance Computing and Visualisation PhD Candidate
> > > > > > Department of Computer Science
> > > > > > University of Warwick
> > > > >
> > >
>

Mime
View raw message