accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: spark with AccumuloRowInputFormat?
Date Mon, 04 May 2015 16:22:17 GMT
Hi, Marc,

If your rows are small you can use the WholeRowIterator to get all the
values with the key in one consuming function. If your rows are big but you
know up-front that you'll only need a small part of each row, you could put
a filter in front of the WholeRowIterator.

I expect there's a performance hit (I haven't done any benchmarks myself)
because of the extra serialization/deserialization but it's a very
convenient way of working with Rows in Spark.

Regards,
-Russ

On Mon, May 4, 2015 at 8:46 AM, Marc Reichman <mreichman@pixelforensics.com>
wrote:

> Has anyone done any testing with Spark and AccumuloRowInputFormat? I have
> no problem doing this for AccumuloInputFormat:
>
> JavaPairRDD<Key, Value> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(),
>         AccumuloInputFormat.class,
>         Key.class, Value.class);
>
> But I run into a snag trying to do a similar thing:
>
> JavaPairRDD<Text, PeekingIterator<Map.Entry<Key, Value>>> pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
>         AccumuloRowInputFormat.class,
>         Text.class, PeekingIterator.class);
>
> The compilation error is (big, sorry):
>
> Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext
cannot be applied to given types;
>   required: org.apache.hadoop.conf.Configuration,java.lang.Class<F>,java.lang.Class<K>,java.lang.Class<V>
>   found: org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class<org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.core.util.PeekingIterator>
>   reason: inferred type does not conform to declared bound(s)
>     inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
>     bound(s): org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator>
>
> I've tried a few things, the signature of the function is:
>
> public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>> JavaPairRDD<K,
V> newAPIHadoopRDD(Configuration conf, Class<F> fClass, Class<K> kClass, Class<V>
vClass)
>
> I guess it's having trouble with the format extending InputFormatBase with
> its own additional generic parameters (the Map.Entry inside
> PeekingIterator).
>
> This may be an issue to chase with Spark vs Accumulo, unless something can
> be tweaked on the Accumulo side or I could wrap the InputFormat with my own
> somehow.
>
> Accumulo 1.6.1, Spark 1.3.1, JDK 7u71.
>
> Stopping short of this, can anyone think of a good way to use
> AccumuloInputFormat to get what I'm getting from the Row version in a
> performant way? It doesn't necessarily have to be an iterator approach, but
> I'd need all my values with the key in one consuming function. I'm looking
> into ways to do it in spark functions but trying to avoid any major
> performance hits.
>
> Thanks,
>
> Marc
>
> p.s. The summit was absolutely great, thank you all for having it!
>
>

Mime
View raw message