Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.191.145.13
 which is an MX secondary for user@accumulo.apache.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAPVnKEnx9FXBEiKTMurUSv1g6_J68ThMRFGFHBjcWq2jdrdyUw@mail.gmail.com>
References: 
 <CADDp_G9Mg2gzNqfxLNg9QtzCm1M1+c4m=PKhUT5Z9WZvzV423w@mail.gmail.com>
	<CAPVnKEnx9FXBEiKTMurUSv1g6_J68ThMRFGFHBjcWq2jdrdyUw@mail.gmail.com>
Date: Mon, 4 May 2015 11:51:37 -0500
Message-ID: 
 <CADDp_G8j9D7r93vvOZ8heVHbS=snMpHm=9x3BUwxt711NVAVgw@mail.gmail.com>
Subject: Re: spark with AccumuloRowInputFormat?
From: Marc Reichman <mreichman@pixelforensics.com>
To: accumulo-user <user@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=001a1140ff24186f88051544606f

--001a1140ff24186f88051544606f
Content-Type: text/plain; charset=UTF-8

Hi Russ,

How exactly would this work regarding column qualifiers, etc, as those are
part of the key? I apologize but I'm not as familiar with the
WholeRowIterator use model, does it consolidate based on the rowkey, and
then return some Key+Value "value" which has all the original information
serialized?

My rows aren't gigantic but they can occasionally get into the 10s of MB.

On Mon, May 4, 2015 at 11:22 AM, Russ Weeks <rweeks@newbrightidea.com>
wrote:

> Hi, Marc,
>
> If your rows are small you can use the WholeRowIterator to get all the
> values with the key in one consuming function. If your rows are big but you
> know up-front that you'll only need a small part of each row, you could put
> a filter in front of the WholeRowIterator.
>
> I expect there's a performance hit (I haven't done any benchmarks myself)
> because of the extra serialization/deserialization but it's a very
> convenient way of working with Rows in Spark.
>
> Regards,
> -Russ
>
> On Mon, May 4, 2015 at 8:46 AM, Marc Reichman <
> mreichman@pixelforensics.com> wrote:
>
>> Has anyone done any testing with Spark and AccumuloRowInputFormat? I have
>> no problem doing this for AccumuloInputFormat:
>>
>> JavaPairRDD<Key, Value> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(),
>>         AccumuloInputFormat.class,
>>         Key.class, Value.class);
>>
>> But I run into a snag trying to do a similar thing:
>>
>> JavaPairRDD<Text, PeekingIterator<Map.Entry<Key, Value>>> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(),
>>         AccumuloRowInputFormat.class,
>>         Text.class, PeekingIterator.class);
>>
>> The compilation error is (big, sorry):
>>
>> Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types;
>>   required: org.apache.hadoop.conf.Configuration,java.lang.Class<F>,java.lang.Class<K>,java.lang.Class<V>
>>   found: org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class<org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.core.util.PeekingIterator>
>>   reason: inferred type does not conform to declared bound(s)
>>     inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
>>     bound(s): org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator>
>>
>> I've tried a few things, the signature of the function is:
>>
>> public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>> JavaPairRDD<K, V> newAPIHadoopRDD(Configuration conf, Class<F> fClass, Class<K> kClass, Class<V> vClass)
>>
>> I guess it's having trouble with the format extending InputFormatBase
>> with its own additional generic parameters (the Map.Entry inside
>> PeekingIterator).
>>
>> This may be an issue to chase with Spark vs Accumulo, unless something
>> can be tweaked on the Accumulo side or I could wrap the InputFormat with my
>> own somehow.
>>
>> Accumulo 1.6.1, Spark 1.3.1, JDK 7u71.
>>
>> Stopping short of this, can anyone think of a good way to use
>> AccumuloInputFormat to get what I'm getting from the Row version in a
>> performant way? It doesn't necessarily have to be an iterator approach, but
>> I'd need all my values with the key in one consuming function. I'm looking
>> into ways to do it in spark functions but trying to avoid any major
>> performance hits.
>>
>> Thanks,
>>
>> Marc
>>
>> p.s. The summit was absolutely great, thank you all for having it!
>>
>>
>

--001a1140ff24186f88051544606f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Russ,<div><br></div><div>How exactly would this work re=
garding column qualifiers, etc, as those are part of the key? I apologize b=
ut I&#39;m not as familiar with the WholeRowIterator use model, does it con=
solidate based on the rowkey, and then return some Key+Value &quot;value&qu=
ot; which has all the original information serialized?</div><div><br></div>=
<div>My rows aren&#39;t gigantic but they can occasionally get into the 10s=
 of MB.</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote=
">On Mon, May 4, 2015 at 11:22 AM, Russ Weeks <span dir=3D"ltr">&lt;<a href=
=3D"mailto:rweeks@newbrightidea.com" target=3D"_blank">rweeks@newbrightidea=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r">Hi, Marc,<div><br></div><div>If your rows are small you can use the Whol=
eRowIterator to get all the values with the key in one consuming function. =
If your rows are big but you know up-front that you&#39;ll only need a smal=
l part of each row, you could put a filter in front of the WholeRowIterator=
.</div><div><br></div><div>I expect there&#39;s a performance hit (I haven&=
#39;t done any benchmarks myself) because of the extra serialization/deseri=
alization but it&#39;s a very convenient way of working with Rows in Spark.=
</div><div><br></div><div>Regards,</div><div>-Russ</div></div><div class=3D=
"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gma=
il_quote">On Mon, May 4, 2015 at 8:46 AM, Marc Reichman <span dir=3D"ltr">&=
lt;<a href=3D"mailto:mreichman@pixelforensics.com" target=3D"_blank">mreich=
man@pixelforensics.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr">Has anyone done any testing with Spark and AccumuloRow=
InputFormat? I have no problem doing this for AccumuloInputFormat:<div><pre=
 style=3D"color:rgb(0,0,0);font-family:&#39;Courier New&#39;;font-size:9pt"=
>JavaPairRDD&lt;Key, Value&gt; pairRDD =3D sparkContext.newAPIHadoopRDD(job=
.getConfiguration(),<br>        AccumuloInputFormat.<span style=3D"color:rg=
b(0,0,128);font-weight:bold">class</span>,<br>        Key.<span style=3D"co=
lor:rgb(0,0,128);font-weight:bold">class</span>, Value.<span style=3D"color=
:rgb(0,0,128);font-weight:bold">class</span>);</pre>But I run into a snag t=
rying to do a similar thing:<pre><pre style=3D"color:rgb(0,0,0);font-family=
:&#39;Courier New&#39;;font-size:9pt">JavaPairRDD&lt;Text, PeekingIterator&=
lt;Map.Entry&lt;Key, Value&gt;&gt;&gt; pairRDD =3D sparkContext.newAPIHadoo=
pRDD(job.getConfiguration(),<br>        AccumuloRowInputFormat.<span style=
=3D"color:rgb(0,0,128);font-weight:bold">class</span>,<br>        Text.<spa=
n style=3D"color:rgb(0,0,128);font-weight:bold">class</span>, PeekingIterat=
or.<span style=3D"color:rgb(0,0,128);font-weight:bold">class</span>);</pre>=
<span style=3D"font-family:arial,sans-serif;white-space:normal">The compila=
tion error is (big, sorry):</span><br><pre><font color=3D"#000000" face=3D"=
Courier New">Error:(141, 97) java: method newAPIHadoopRDD in class org.apac=
he.spark.api.java.JavaSparkContext cannot be applied to given types;
  required: org.apache.hadoop.conf.Configuration,java.lang.Class&lt;F&gt;,j=
ava.lang.Class&lt;K&gt;,java.lang.Class&lt;V&gt;
  found: org.apache.hadoop.conf.Configuration,java.lang.Class&lt;org.apache=
.accumulo.core.client.mapreduce.AccumuloRowInputFormat&gt;,java.lang.Class&=
lt;org.apache.hadoop.io.Text&gt;,java.lang.Class&lt;org.apache.accumulo.cor=
e.util.PeekingIterator&gt;
  reason: inferred type does not conform to declared bound(s)
    inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFor=
mat
    bound(s): org.apache.hadoop.mapreduce.InputFormat&lt;org.apache.hadoop.=
io.Text,org.apache.accumulo.core.util.PeekingIterator&gt;<span style=3D"fon=
t-size:9pt"><br></span></font></pre><pre><font color=3D"#000000" face=3D"ar=
ial, helvetica, sans-serif">I&#39;ve tried a few things, the signature of t=
he function is:</font></pre><pre><pre style=3D"color:rgb(0,0,0);font-family=
:&#39;Courier New&#39;;font-size:9pt"><span style=3D"color:rgb(0,0,128);fon=
t-weight:bold">public </span>&lt;K, V, F <span style=3D"color:rgb(0,0,128);=
font-weight:bold">extends </span>org.apache.hadoop.mapreduce.InputFormat&lt=
;K, V&gt;&gt; JavaPairRDD&lt;K, V&gt; newAPIHadoopRDD(Configuration conf, C=
lass&lt;F&gt; fClass, Class&lt;K&gt; kClass, Class&lt;V&gt; vClass)</pre><f=
ont face=3D"arial, helvetica, sans-serif">I guess it&#39;s having trouble w=
ith the format extending InputFormatBase with its own additional generic pa=
rameters (the Map.Entry inside PeekingIterator).<br><br>This may be an issu=
e to chase with Spark vs Accumulo, unless something can be tweaked on the A=
ccumulo side or I could wrap the InputFormat with my own somehow.<br><br>Ac=
cumulo 1.6.1, Spark 1.3.1, JDK 7u71.<br><br>Stopping short of this, can any=
one think of a good way to use AccumuloInputFormat to get what I&#39;m gett=
ing from the Row version in a performant way? It doesn&#39;t necessarily ha=
ve to be an iterator approach, but I&#39;d need all my values with the key =
in one consuming function. I&#39;m looking into ways to do it in spark func=
tions but trying to avoid any major performance hits.<br><br>Thanks,<br><br=
>Marc<br><br>p.s. The summit was absolutely great, thank you all for having=
 it!</font></pre></pre></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a1140ff24186f88051544606f--