spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anchit Choudhry <anchit.choud...@gmail.com>
Subject Re: How to get the HDFS path for each RDD
Date Fri, 25 Sep 2015 03:25:37 GMT
Hi Fengdong,

Thanks for your question.

Spark already has a function called wholeTextFiles within sparkContext
which can help you with that:

Python

hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001
...hdfs://a-hdfs-path/part-nnnnn

rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)

(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)

More info: http://spark.apache.org/docs/latest/api/python/pyspark
.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles

------------

Scala

val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")

More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]

Let us know if this helps or you need more help.

Thanks,
Anchit Choudhry

On 24 September 2015 at 23:12, Fengdong Yu <fengdongy@everstring.com> wrote:

> Hi,
>
> I have  multiple files with JSON format, such as:
>
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
>
>
> I can sc.textFile(“/data/*/*”)
>
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save
> it the one target HDFS location.
>
> how to do it, Thanks.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message