spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jyoti Misra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
Date Thu, 05 May 2016 08:06:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272048#comment-15272048
] 

Jyoti Misra edited comment on SPARK-2365 at 5/5/16 8:05 AM:
------------------------------------------------------------

We have migrated our application in Spark and all the use cases work very well except updation
of RDDs.
Ankur's IndexedRDD is a ray of hope for us to enhance the performance of this use case as
well.

But we are not able to achieve the same because we are not able to leverage in Spark on Java.
And the examples cited on websites are for Scala.

When we try to convert Java RDD to IndexedRDD (https://github.com/amplab/spark-indexedrdd)
we are getting Classcast Exception. 
Is there any way to convert ?

Below is the code snippet:

JavaPairRDD<String, String> mappedRDD =  lines.flatMapToPair( new PairFlatMapFunction<String,
String, String>()
    {
        @Override
        public Iterable<Tuple2<String, String>> call(String arg0) throws Exception
{

            String[] arr = arg0.split(" ",2);
            System.out.println( "lenght" + arr.length);
             List<Tuple2<String, String>> results = new ArrayList<Tuple2<String,
String>>();
             results.addAll(results);
            return results;
        }
    });        
    IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap()

The above gives class cast exception.

We also tried using below code:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());
The above line gives compile time error  - The constructor IndexedRDD<String,String>(JavaPairRDD<String,String>)
is undefined

We are using Spark version 1.4.1:	
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId>
<version>1.4.1</version> </dependency>

We would appreciate any help on this.

We have posted a query on this @ http://stackoverflow.com/questions/32137484/spark-rdd-to-update



was (Author: jyotimisra):
We have migrated our application in Spark and all the use cases work very well except updation
of RDDs.
Ankur's IndexedRDD is a ray of hope for us to enhance the performance of this use case as
well.

But we are not able to achieve the same because we are not able to leverage in Spark on Java.
And the examples cited on websites are for Scala.

When we try to convert Java RDD to IndexedRDD (https://github.com/amplab/spark-indexedrdd)
we are getting Classcast Exception. 
Is there any way to convert ?

Below is the code snippet:

JavaPairRDD<String, String> mappedRDD =  lines.flatMapToPair( new PairFlatMapFunction<String,
String, String>()
    {
        @Override
        public Iterable<Tuple2<String, String>> call(String arg0) throws Exception
{

            String[] arr = arg0.split(" ",2);
            System.out.println( "lenght" + arr.length);
             List<Tuple2<String, String>> results = new ArrayList<Tuple2<String,
String>>();
             results.addAll(results);
            return results;
        }
    });        
    IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap()

The above gives class cast exception.

We also tried using below code:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());
The above line gives compile time error  - The constructor IndexedRDD<String,String>(JavaPairRDD<String,String>)
is undefined

We are using Spark version 1.4.1:	
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId>
<version>1.4.1</version> </dependency>

We would appreciate any help on this.

> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal
requirements on the storage layer, which only needs to support sequential access, enabling
on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient support for
point lookups would enable serving data out of RDDs, but it currently requires iterating over
an entire partition to find the desired element. Point updates similarly require copying an
entire iterator. Joins are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value store built
on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing
the entries for efficient joins and point lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining
a hash index within each partition, and (3) using purely functional (immutable and efficiently
updatable) data structures to enable efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a limited
form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD,
including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy
for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message