Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98A5A17B35 for ; Tue, 21 Oct 2014 08:23:59 +0000 (UTC) Received: (qmail 82926 invoked by uid 500); 21 Oct 2014 08:23:59 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 82881 invoked by uid 500); 21 Oct 2014 08:23:59 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 82871 invoked by uid 99); 21 Oct 2014 08:23:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2014 08:23:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of major.error@gmail.com designates 209.85.215.54 as permitted sender) Received: from [209.85.215.54] (HELO mail-la0-f54.google.com) (209.85.215.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Oct 2014 08:23:33 +0000 Received: by mail-la0-f54.google.com with SMTP id gm9so569459lab.27 for ; Tue, 21 Oct 2014 01:23:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=2yL+vCxndWkO2FrYiOukr8QQQNMCKLVC2UySnjHeA2w=; b=bzuQpR9ujuSzsq+lC/yQ++ybGyNMEN23rYHAY6D+oc+dJkvFBaTov+uM1GmOA3LKXn YUHhjYitc/bDdcWXXqYr7NPH8t3GOIOCd2rRgMHdZJ5a1urOFRgSlK6ovzNycJWmIVqf D040y3ONx3RljNdNHG6XerQ+HVPcWl6Da4xIISR8sfAajIAXTafBXz1Lxdyqg+zzUWo9 MWOcUQFgj5dr3UXqnsYzVe84KEOiO0KtwKxMUPlsOip6NeumjRjo88NAKR786PRh9Tq6 Mg/cCMuKXp61tTP5RSF3ejK6Qg2knqXlhOXGVwgN2MDsMGWWEk8N7z/zKxJqO8WMDwjD xpmg== X-Received: by 10.112.29.175 with SMTP id l15mr32308894lbh.39.1413879812737; Tue, 21 Oct 2014 01:23:32 -0700 (PDT) MIME-Version: 1.0 Sender: major.error@gmail.com Received: by 10.112.198.33 with HTTP; Tue, 21 Oct 2014 01:23:12 -0700 (PDT) In-Reply-To: <20141020231949.GB44622@ll.mit.edu> References: <20141020231949.GB44622@ll.mit.edu> From: Peter Coetzee Date: Tue, 21 Oct 2014 09:23:12 +0100 X-Google-Sender-Auth: W3qKI2GmdhODxZge3JMfRZ3VMS0 Message-ID: Subject: Re: New research using Accumulo: Unified Secure On-/Off-line Analytics To: user@accumulo.apache.org, kepner@ll.mit.edu Content-Type: multipart/alternative; boundary=001a1133f53cfe52910505ea8bfb X-Virus-Checked: Checked by ClamAV on apache.org --001a1133f53cfe52910505ea8bfb Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Jeremy, If you're viewing the PDF form of the paper (Elsevier's HTML rendering has some odd artefacts), there's a short explanation of the figure appearing just after it: At higher scales, CRUCIBLE=E2=80=99s Spark-HDFS environment can even be see= n to > outperform a native implementation making use of the more expressive Spar= k > builtins. Performing bulk analysis through the use of Accumulo Iterators > with CRUCIBLE was approximately 10x slower than the equivalent native > implementation; with Spark on HDFS files, this is now almost 1.2x faster > than the native implementation used. The "native" implementations (i.e. hand-written by an engineer using the tools offered by the standard platform) are shown as dashed series on the chart, while the other series represent a single CRUCIBLE topology, compiled once and executed on a collection of runtimes (each of which are discussed in more detail earlier in the paper). By way of clarification; are you curious as to what the figure shows, or why those results are demonstrated? Hope this helps somewhat. Best regards, Peter On 21 October 2014 00:19, Jeremy Kepner wrote: > Hi Peter, > Thanks. Can you clarify Figure 12 in the paper. I think I understand > what it is saying, but I am not 100% sure. > > Regards. -Jeremy > > On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote: > > New open-access research published in the journal of Parallel Computing > > demonstrates a novel approach to engineering analytics for deployment i= n > > streaming and batch contexts. > > > > Increasing numbers of users are extracting real value from their data > using > > tools like IBM InfoSphere Streams for near-real-time analysis and Apach= e > > Spark across their historical data in Accumulo. > > > > Until now, there hasn't been an approach which permits the use of these > > tools from a single shared codebase, with deployment considerations > > reserved until deployment time. Furthermore, it has been even harder to > > permit this unified analysis while maintaining cell-level traces of the > > security heritage for each datum an analytic produces. > > > > Some highlights of the paper include: > > - A domain specific language (CRUCIBLE) and runtime models for on- an= d > > off-line data analytics. > > - Detailed analysis of CRUCIBLE=E2=80=99s runtime performance in > state-of-the-art > > environments. > > - Development and detailed analysis of a set of runtime models for ne= w > > environments. > > - Performance comparison with native implementations and discussion o= f > > optimisation steps. > > - Formulation of a primitive in the DSL that permits an analytic to b= e > > run over multiple data sources. > > > > The paper, Towards Unified Secure On- and Off-line Analytics at Scale, = is > > available free of charge from Elsevier: > > > > http://www.sciencedirect.com/science/article/pii/S0167819114000842 > > > > > > I am one of the lead authors of the work, and would be more than happy = to > > discuss any aspects which catch your attention! > > > > Peter > > > > -- > > Peter Coetzee > > Performance Computing and Visualisation PhD Candidate > > Department of Computer Science > > University of Warwick > --001a1133f53cfe52910505ea8bfb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Jeremy,

If you're viewing the PD= F form of the paper (Elsevier's HTML rendering has some odd artefacts),= there's a short explanation of the figure appearing just after it:

At higher scales, CRUCIBLE=E2=80=99s= Spark-HDFS environment can even be seen to outperform a native implementat= ion making use of the more expressive Spark builtins. Performing bulk analy= sis through the use of Accumulo Iterators with CRUCIBLE was approximately 1= 0x slower than the equivalent native implementation; with Spark on HDFS fil= es, this is now almost 1.2x faster than the native implementation used.

The "native" implementations (i.e. h= and-written by an engineer using the tools offered by the standard platform= ) are shown as dashed series on the chart, while the other series represent= a single CRUCIBLE topology, compiled once and executed on a collection of = runtimes (each of which are discussed in more detail earlier in the paper).= =C2=A0

By way of clarification; are you curi= ous as to what the figure shows, or why those results are demonstrated?

Hope this helps somewhat.

Be= st regards,
Peter



On 21 October 2014 00:1= 9, Jeremy Kepner <kepner@ll.mit.edu> wrote:
Hi Peter,
=C2=A0 Thanks.=C2=A0 Can you clarify Figure 12 in the paper.=C2=A0 I think = I understand
what it is saying, but I am not 100% sure.

Regards.=C2=A0 -Jeremy

On Mon, Oct 20, 2014 at 09:00:51AM +0100, Peter Coetzee wrote:
> New open-access research published in the journal of Parallel Computin= g
> demonstrates a novel approach to engineering analytics for deployment = in
> streaming and batch contexts.
>
> Increasing numbers of users are extracting real value from their data = using
> tools like IBM InfoSphere Streams for near-real-time analysis and Apac= he
> Spark across their historical data in Accumulo.
>
> Until now, there hasn't been an approach which permits the use of = these
> tools from a single shared codebase, with deployment considerations > reserved until deployment time. Furthermore, it has been even harder t= o
> permit this unified analysis while maintaining cell-level traces of th= e
> security heritage for each datum an analytic produces.
>
> Some highlights of the paper include:
>=C2=A0 =C2=A0- A domain specific language (CRUCIBLE) and runtime models= for on- and
> off-line data analytics.
>=C2=A0 =C2=A0- Detailed analysis of CRUCIBLE=E2=80=99s runtime performa= nce in state-of-the-art
> environments.
>=C2=A0 =C2=A0- Development and detailed analysis of a set of runtime mo= dels for new
> environments.
>=C2=A0 =C2=A0- Performance comparison with native implementations and d= iscussion of
> optimisation steps.
>=C2=A0 =C2=A0- Formulation of a primitive in the DSL that permits an an= alytic to be
> run over multiple data sources.
>
> The paper, Towards Unified Secure On- and Off-line Analytics at Scale,= is
> available free of charge from Elsevier:
>
> http://www.sciencedirect.com/science/article/pii= /S0167819114000842
>
>
> I am one of the lead authors of the work, and would be more than happy= to
> discuss any aspects which catch your attention!
>
> Peter
>
> --
> Peter Coetzee
> Performance Computing and Visualisation PhD Candidate
> Department of Computer Science
> University of Warwick

--001a1133f53cfe52910505ea8bfb--