Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B055179BE for ; Mon, 4 May 2015 16:52:04 +0000 (UTC) Received: (qmail 91389 invoked by uid 500); 4 May 2015 16:52:04 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 91336 invoked by uid 500); 4 May 2015 16:52:04 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 91325 invoked by uid 99); 4 May 2015 16:52:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 16:52:04 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for user@accumulo.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 16:51:58 +0000 Received: from mail-ig0-f179.google.com (mail-ig0-f179.google.com [209.85.213.179]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 8C96224BCA for ; Mon, 4 May 2015 16:51:38 +0000 (UTC) Received: by igbpi8 with SMTP id pi8so69934318igb.0 for ; Mon, 04 May 2015 09:51:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pixelforensics.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Eudtts31ZAEIS4rTEhCZsZSdJOf3vHj9d16t2XmAj5k=; b=igfNCEb9OMEqy1seDIAthcQ1u7Qa8HxyXamfbUbHa6gaB7MLrDEWuVxJfIQ1Ha37y8 UrQofKCKE6Js+3xBcpjnZld7mEmXldoA3LXMJcvSfhNIR+BPUXjJyAszSdLjmliLKRCi L3GiKbo68Viol3UtsxlpR0jZzPp93PcvlrFRM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=Eudtts31ZAEIS4rTEhCZsZSdJOf3vHj9d16t2XmAj5k=; b=dT3JF5dY1BXFvgV9gXgvcRhyBbc547O6BvZ5XBXo/YktNAJRJSCxt3DZASxFNUQaI5 TXHI8YdQZVrRneCI1YYPpev1uBvQ60pRRvbVy6nEe89K56QoGNh+no5zDtXIPdna+7nb 7dJfI2BchyjecNPhreXz882ShAdEpOr9Q5FAgevLxRsziQ53bwd9gLm8GNhSZol+pdAO wxIq8+b9lrLd0HKP4cKRPGSoxqIWzC1FT2ZRIb3ROCqDqEMWPIb+ar8KFrkrCAPMvgK2 cpmrGmdO9KIUhtng8SoXP/JyLRDIHGygLNnVPlkzFbknlh8szZEQNuEm5+bEHTJu1oXU zSRg== X-Gm-Message-State: ALoCoQmJD5ivjgqlaKCWFze3XQMufOFMLJWPjufMCT3h9DIimu7txFCa9mzBrsEVm820kmJPKEFo MIME-Version: 1.0 X-Received: by 10.107.162.147 with SMTP id l141mr435911ioe.77.1430758297710; Mon, 04 May 2015 09:51:37 -0700 (PDT) Received: by 10.107.163.204 with HTTP; Mon, 4 May 2015 09:51:37 -0700 (PDT) In-Reply-To: References: Date: Mon, 4 May 2015 11:51:37 -0500 Message-ID: Subject: Re: spark with AccumuloRowInputFormat? From: Marc Reichman To: accumulo-user Content-Type: multipart/alternative; boundary=001a1140ff24186f88051544606f X-Virus-Checked: Checked by ClamAV on apache.org --001a1140ff24186f88051544606f Content-Type: text/plain; charset=UTF-8 Hi Russ, How exactly would this work regarding column qualifiers, etc, as those are part of the key? I apologize but I'm not as familiar with the WholeRowIterator use model, does it consolidate based on the rowkey, and then return some Key+Value "value" which has all the original information serialized? My rows aren't gigantic but they can occasionally get into the 10s of MB. On Mon, May 4, 2015 at 11:22 AM, Russ Weeks wrote: > Hi, Marc, > > If your rows are small you can use the WholeRowIterator to get all the > values with the key in one consuming function. If your rows are big but you > know up-front that you'll only need a small part of each row, you could put > a filter in front of the WholeRowIterator. > > I expect there's a performance hit (I haven't done any benchmarks myself) > because of the extra serialization/deserialization but it's a very > convenient way of working with Rows in Spark. > > Regards, > -Russ > > On Mon, May 4, 2015 at 8:46 AM, Marc Reichman < > mreichman@pixelforensics.com> wrote: > >> Has anyone done any testing with Spark and AccumuloRowInputFormat? I have >> no problem doing this for AccumuloInputFormat: >> >> JavaPairRDD pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), >> AccumuloInputFormat.class, >> Key.class, Value.class); >> >> But I run into a snag trying to do a similar thing: >> >> JavaPairRDD>> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), >> AccumuloRowInputFormat.class, >> Text.class, PeekingIterator.class); >> >> The compilation error is (big, sorry): >> >> Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; >> required: org.apache.hadoop.conf.Configuration,java.lang.Class,java.lang.Class,java.lang.Class >> found: org.apache.hadoop.conf.Configuration,java.lang.Class,java.lang.Class,java.lang.Class >> reason: inferred type does not conform to declared bound(s) >> inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat >> bound(s): org.apache.hadoop.mapreduce.InputFormat >> >> I've tried a few things, the signature of the function is: >> >> public > JavaPairRDD newAPIHadoopRDD(Configuration conf, Class fClass, Class kClass, Class vClass) >> >> I guess it's having trouble with the format extending InputFormatBase >> with its own additional generic parameters (the Map.Entry inside >> PeekingIterator). >> >> This may be an issue to chase with Spark vs Accumulo, unless something >> can be tweaked on the Accumulo side or I could wrap the InputFormat with my >> own somehow. >> >> Accumulo 1.6.1, Spark 1.3.1, JDK 7u71. >> >> Stopping short of this, can anyone think of a good way to use >> AccumuloInputFormat to get what I'm getting from the Row version in a >> performant way? It doesn't necessarily have to be an iterator approach, but >> I'd need all my values with the key in one consuming function. I'm looking >> into ways to do it in spark functions but trying to avoid any major >> performance hits. >> >> Thanks, >> >> Marc >> >> p.s. The summit was absolutely great, thank you all for having it! >> >> > --001a1140ff24186f88051544606f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Russ,

How exactly would this work re= garding column qualifiers, etc, as those are part of the key? I apologize b= ut I'm not as familiar with the WholeRowIterator use model, does it con= solidate based on the rowkey, and then return some Key+Value "value&qu= ot; which has all the original information serialized?

=
My rows aren't gigantic but they can occasionally get into the 10s= of MB.

On Mon, May 4, 2015 at 11:22 AM, Russ Weeks <rweeks@newbrightidea= .com> wrote:
Hi, Marc,

If your rows are small you can use the Whol= eRowIterator to get all the values with the key in one consuming function. = If your rows are big but you know up-front that you'll only need a smal= l part of each row, you could put a filter in front of the WholeRowIterator= .

I expect there's a performance hit (I haven&= #39;t done any benchmarks myself) because of the extra serialization/deseri= alization but it's a very convenient way of working with Rows in Spark.=

Regards,
-Russ

On Mon, May 4, 2015 at 8:46 AM, Marc Reichman &= lt;mreich= man@pixelforensics.com> wrote:
Has anyone done any testing with Spark and AccumuloRow= InputFormat? I have no problem doing this for AccumuloInputFormat:
JavaPairRDD<Key, Value> pairRDD =3D sparkContext.newAPIHadoopRDD(job= .getConfiguration(),
AccumuloInputFormat.class,
Key.class, Value.class);But I run into a snag t= rying to do a similar thing:
JavaPairRDD<Text, PeekingIterator&=
lt;Map.Entry<Key, Value>>> pairRDD =3D sparkContext.newAPIHadoo=
pRDD(job.getConfiguration(),
AccumuloRowInputFormat.class,
Text.class, PeekingIterat= or.class);
= The compila= tion error is (big, sorry):
Error:(141, 97) java: method newAPIHadoopRDD in class org.apac=
he.spark.api.java.JavaSparkContext cannot be applied to given types;
  required: org.apache.hadoop.conf.Configuration,java.lang.Class<F>,j=
ava.lang.Class<K>,java.lang.Class<V>
  found: org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache=
.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class&=
lt;org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.cor=
e.util.PeekingIterator>
  reason: inferred type does not conform to declared bound(s)
    inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFor=
mat
    bound(s): org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.=
io.Text,org.apache.accumulo.core.util.PeekingIterator>
I've tried a few things, the signature of t=
he function is:
public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<=
;K, V>> JavaPairRDD<K, V> newAPIHadoopRDD(Configuration conf, C=
lass<F> fClass, Class<K> kClass, Class<V> vClass)
I guess it's having trouble w= ith the format extending InputFormatBase with its own additional generic pa= rameters (the Map.Entry inside PeekingIterator).

This may be an issu= e to chase with Spark vs Accumulo, unless something can be tweaked on the A= ccumulo side or I could wrap the InputFormat with my own somehow.

Ac= cumulo 1.6.1, Spark 1.3.1, JDK 7u71.

Stopping short of this, can any= one think of a good way to use AccumuloInputFormat to get what I'm gett= ing from the Row version in a performant way? It doesn't necessarily ha= ve to be an iterator approach, but I'd need all my values with the key = in one consuming function. I'm looking into ways to do it in spark func= tions but trying to avoid any major performance hits.

Thanks,
Marc

p.s. The summit was absolutely great, thank you all for having= it!


--001a1140ff24186f88051544606f--