accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Kepner <kep...@ll.mit.edu>
Subject Re: New research using Accumulo: Unified Secure On-/Off-line Analytics
Date Tue, 21 Oct 2014 14:07:12 GMT
So of the six lines on the graph:  Accumulo v1, Accumulo v2, Spark-Accumulo, Spark-HDFS, Native
Accumulo, Native Spark
which were implemented with  CRUCIBLE

On Tue, Oct 21, 2014 at 09:23:12AM +0100, Peter Coetzee wrote:
> Hi Jeremy,
> 
> If you're viewing the PDF form of the paper (Elsevier's HTML rendering has
> some odd artefacts), there's a short explanation of the figure appearing
> just after it:
> 
> At higher scales, CRUCIBLE’s Spark-HDFS environment can even be seen to
> > outperform a native implementation making use of the more expressive Spark
> > builtins. Performing bulk analysis through the use of Accumulo Iterators
> > with CRUCIBLE was approximately 10x slower than the equivalent native
> > implementation; with Spark on HDFS files, this is now almost 1.2x faster
> > than the native implementation used.
> 
> 
> The "native" implementations (i.e. hand-written by an engineer using the
> tools offered by the standard platform) are shown as dashed series on the
> chart, while the other series represent a single CRUCIBLE topology,
> compiled once and executed on a collection of runtimes (each of which are
> discussed in more detail earlier in the paper).
> 
> By way of clarification; are you curious as to what the figure shows, or
> why those results are demonstrated?
> 
> Hope this helps somewhat.
> 
> Best regards,
> Peter
> 
> 
> 
> On 21 October 2014 00:19, Jeremy Kepner <kepner@ll.mit.edu> wrote:
> 
> > Hi Peter,
> >   Thanks.  Can you clarify Figure 12 in the paper.  I think I understand
> > what it is saying, but I am not 100% sure.
> >
> > Regards.  -Jeremy
> >
> > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> > > New open-access research published in the journal of Parallel Computing
> > > demonstrates a novel approach to engineering analytics for deployment in
> > > streaming and batch contexts.
> > >
> > > Increasing numbers of users are extracting real value from their data
> > using
> > > tools like IBM InfoSphere Streams for near-real-time analysis and Apache
> > > Spark across their historical data in Accumulo.
> > >
> > > Until now, there hasn't been an approach which permits the use of these
> > > tools from a single shared codebase, with deployment considerations
> > > reserved until deployment time. Furthermore, it has been even harder to
> > > permit this unified analysis while maintaining cell-level traces of the
> > > security heritage for each datum an analytic produces.
> > >
> > > Some highlights of the paper include:
> > >   - A domain specific language (CRUCIBLE) and runtime models for on- and
> > > off-line data analytics.
> > >   - Detailed analysis of CRUCIBLE’s runtime performance in
> > state-of-the-art
> > > environments.
> > >   - Development and detailed analysis of a set of runtime models for new
> > > environments.
> > >   - Performance comparison with native implementations and discussion of
> > > optimisation steps.
> > >   - Formulation of a primitive in the DSL that permits an analytic to be
> > > run over multiple data sources.
> > >
> > > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, is
> > > available free of charge from Elsevier:
> > >
> > > http://www.sciencedirect.com/science/article/pii/S0167819114000842
> > >
> > >
> > > I am one of the lead authors of the work, and would be more than happy to
> > > discuss any aspects which catch your attention!
> > >
> > > Peter
> > >
> > > --
> > > Peter Coetzee
> > > Performance Computing and Visualisation PhD Candidate
> > > Department of Computer Science
> > > University of Warwick
> >

Mime
View raw message