accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Kepner <kep...@ll.mit.edu>
Subject Re: New research using Accumulo: Unified Secure On-/Off-line Analytics
Date Tue, 21 Oct 2014 14:28:06 GMT
Hi Peter,
  So the Y axis is labeled "Execution Time (s)" which would imply
"Accumulo v2" using CRUCIBLE is 10 times slower than the "Native Accumulo"
which doesn't use CRUCIBLE.  Is this correct?

Regards.  -Jeremy

On Tue, Oct 21, 2014 at 03:28:54PM +0100, Peter Coetzee wrote:
> Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS were implemented with
> CUCIBLE, each being the same CRUCIBLE code, but executed against a
> different runtime configuration.
> 
> Accumulo v1 represents the pre-optimisation Accumulo Iterator based runtime
> Accumulo v2 represents the post-optimisation Accumulo Iterator based runtime
> Spark-Accumulo makes use of a Standalone Spark cluster, backed by Accumulo
> on HDFS (uses Spark's hadoopRDD with AccumuloInputFormat)
> Spark-HDFS uses the same Standalone Spark cluster, but is operating over
> files in HDFS directly
> 
> 
> 
> On 21 October 2014 15:07, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> 
> > So of the six lines on the graph:  Accumulo v1, Accumulo v2,
> > Spark-Accumulo, Spark-HDFS, Native Accumulo, Native Spark
> > which were implemented with  CRUCIBLE
> >
> > On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> > > Hi Jeremy,
> > >
> > > If you're viewing the PDF form of the paper (Elsevier's HTML rendering
> > has
> > > some odd artefacts), there's a short explanation of the figure appearing
> > > just after it:
> > >
> > > At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > > > outperform a native implementation making use of the more expressive
> > Spark
> > > > builtins. Performing bulk analysis through the use of Accumulo
> > Iterators
> > > > with CRUCIBLE was approximately 10x slower than the equivalent native
> > > > implementation; with Spark on HDFS files, this is now almost 1.2x
> > faster
> > > > than the native implementation used.
> > >
> > >
> > > The "native" implementations (i.e. hand-written by an engineer using the
> > > tools offered by the standard platform) are shown as dashed series on the
> > > chart, while the other series represent a single CRUCIBLE topology,
> > > compiled once and executed on a collection of runtimes (each of which are
> > > discussed in more detail earlier in the paper).
> > >
> > > By way of clarification; are you curious as to what the figure shows, or
> > > why those results are demonstrated?
> > >
> > > Hope this helps somewhat.
> > >
> > > Best regards,
> > > Peter
> > >
> > >
> > >
> > > On 21 October 2014 00:19, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> > >
> > > > Hi Peter,
> > > >   Thanks.  Can you clarify Figure 12 in the paper.  I think I
> > understand
> > > > what it is saying, but I am not 100% sure.
> > > >
> > > > Regards.  -Jeremy
> > > >
> > > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > > > New open-access research published in the journal of Parallel
> > Computing
> > > > > demonstrates a novel approach to engineering analytics for
> > deployment in
> > > > > streaming and batch contexts.
> > > > >
> > > > > Increasing numbers of users are extracting real value from their
data
> > > > using
> > > > > tools like IBM InfoSphere Streams for near-real-time analysis and
> > Apache
> > > > > Spark across their historical data in Accumulo.
> > > > >
> > > > > Until now, there hasn't been an approach which permits the use of
> > these
> > > > > tools from a single shared codebase, with deployment considerations
> > > > > reserved until deployment time. Furthermore, it has been even harder
> > to
> > > > > permit this unified analysis while maintaining cell-level traces
of
> > the
> > > > > security heritage for each datum an analytic produces.
> > > > >
> > > > > Some highlights of the paper include:
> > > > >   - A domain specific language (CRUCIBLE) and runtime models for
on-
> > and
> > > > > off-line data analytics.
> > > > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > > > state-of-the-art
> > > > > environments.
> > > > >   - Development and detailed analysis of a set of runtime models
for
> > new
> > > > > environments.
> > > > >   - Performance comparison with native implementations and
> > discussion of
> > > > > optimisation steps.
> > > > >   - Formulation of a primitive in the DSL that permits an analytic
> > to be
> > > > > run over multiple data sources.
> > > > >
> > > > > The paper, Towards Unified Secure On- and Off-line Analytics at
> > Scale, is
> > > > > available free of charge from Elsevier:
> > > > >
> > > > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > > > >
> > > > >
> > > > > I am one of the lead authors of the work, and would be more than
> > happy to
> > > > > discuss any aspects which catch your attention!
> > > > >
> > > > > Peter
> > > > >
> > > > > --
> > > > > Peter Coetzee
> > > > > Performance Computing and Visualisation PhD Candidate
> > > > > Department of Computer Science
> > > > > University of Warwick
> > > >
> >

Mime
View raw message