Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <B1DABCC7-F0E2-43EE-950E-951BC62E27C6@everstring.com>
References: <3344CCDF-EB5A-4D5D-8FD5-8B637E837784@everstring.com>
 <CANoTX_yvkttJePp5XTBEK+sLPvPdiYuqujFFhh-HxE143zSW+w@mail.gmail.com>
 <B14A2959-3758-4538-A9C8-EF1AB5FBB078@everstring.com>
 <CANoTX_wbjf7YWyFs8JYZT=Pkwjt7RG1AFHdKARzoZaw=sjkeGQ@mail.gmail.com>
 <B1DABCC7-F0E2-43EE-950E-951BC62E27C6@everstring.com>
From: Anchit Choudhry <anchit.choudhry@gmail.com>
Date: Fri, 25 Sep 2015 01:12:43 -0400
Message-ID: 
 <CANoTX_zcmo2cFWkFwjw_F9t15KJ20JKdaOU9XGA5EHEaR3uk0g@mail.gmail.com>
Subject: Re: How to get the HDFS path for each RDD
To: Fengdong Yu <fengdongy@everstring.com>
Cc: dev@spark.apache.org
Content-Type: multipart/alternative; boundary=001a113efc821fb80805208b686e

--001a113efc821fb80805208b686e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Fengdong,

So I created two files in HDFS under a test folder.

test/dt=3D20100101.json
{ "key1" : "value1" }

test/dt=3D20100102.json
{ "key2" : "value2" }

Then inside PySpark shell

rdd =3D sc.wholeTextFiles('./test/*')
rdd.collect()
[(u'hdfs://localhost:9000/user/hduser/test/dt=3D20100101.json', u'{ "key1" =
:
"value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=3D20100102.json',
u'{ "key2" : "value2" })]
import json
def editMe(y, x):
      j =3D json.loads(y)
      j['source'] =3D x
      return j

rdd.map(lambda (x,y): editMe(y,x)).collect()
[{'source': u'hdfs://localhost:9000/user/hduser/test/dt=3D20100101.json',
u'key1': u'value1'}, {u'key2': u'value2', 'source': u'hdfs://localhost
:9000/user/hduser/test/dt=3D20100102.json'}]

Similarly you could modify the function to return 'source' and 'date' with
some string manipulation per your requirements.

Let me know if this helps.

Thanks,
Anchit


On 24 September 2015 at 23:55, Fengdong Yu <fengdongy@everstring.com> wrote=
:

>
> yes. such as I have two data sets:
>
> date set A: /data/test1/dt=3D20100101
> data set B: /data/test2/dt=3D20100202
>
>
> all data has the same JSON format , such as:
> {=E2=80=9Ckey1=E2=80=9D : =E2=80=9Cvalue1=E2=80=9D, =E2=80=9Ckey2=E2=80=
=9D : =E2=80=9Cvalue2=E2=80=9D }
>
>
> my output expected:
> {=E2=80=9Ckey1=E2=80=9D : =E2=80=9Cvalue1=E2=80=9D, =E2=80=9Ckey2=E2=80=
=9D : =E2=80=9Cvalue2=E2=80=9D , =E2=80=9Csource=E2=80=9D : =E2=80=9Ctest1=
=E2=80=9D, =E2=80=9Cdate=E2=80=9D :
> =E2=80=9C20100101"}
> {=E2=80=9Ckey1=E2=80=9D : =E2=80=9Cvalue1=E2=80=9D, =E2=80=9Ckey2=E2=80=
=9D : =E2=80=9Cvalue2=E2=80=9D , =E2=80=9Csource=E2=80=9D : =E2=80=9Ctest2=
=E2=80=9D, =E2=80=9Cdate=E2=80=9D :
> =E2=80=9C20100202"}
>
>
> On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choudhry@gmail.com>
> wrote:
>
> Sure. May I ask for a sample input(could be just few lines) and the outpu=
t
> you are expecting to bring clarity to my thoughts?
>
> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdongy@everstring.com> wrote:
>
>> Hi Anchit,
>>
>> Thanks for the quick answer.
>>
>> my exact question is : I want to add HDFS location into each line in my
>> JSON  data.
>>
>>
>>
>> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choudhry@gmail.com>
>> wrote:
>>
>> Hi Fengdong,
>>
>> Thanks for your question.
>>
>> Spark already has a function called wholeTextFiles within sparkContext
>> which can help you with that:
>>
>> Python
>>
>> hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001
>> ...hdfs://a-hdfs-path/part-nnnnn
>>
>> rdd =3D sparkContext.wholeTextFiles(=E2=80=9Chdfs://a-hdfs-path=E2=80=9D=
)
>>
>> (a-hdfs-path/part-00000, its content)
>> (a-hdfs-path/part-00001, its content)
>> ...
>> (a-hdfs-path/part-nnnnn, its content)
>>
>> More info: http://spark.apache.org/docs/latest/api/python/pyspark
>> .html?highlight=3Dwholetext#pyspark.SparkContext.wholeTextFiles
>>
>> ------------
>>
>> Scala
>>
>> val rdd =3D sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>>
>> More info: https://spark.apache.org/docs/latest/api/scala/index.html#org=
.
>> apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)=
]
>>
>> Let us know if this helps or you need more help.
>>
>> Thanks,
>> Anchit Choudhry
>>
>> On 24 September 2015 at 23:12, Fengdong Yu <fengdongy@everstring.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have  multiple files with JSON format, such as:
>>>
>>> /data/test1_data/sub100/test.data
>>> /data/test2_data/sub200/test.data
>>>
>>>
>>> I can sc.textFile(=E2=80=9C/data/*/*=E2=80=9D)
>>>
>>> but I want to add the {=E2=80=9Csource=E2=80=9D : =E2=80=9CHDFS_LOCATIO=
N=E2=80=9D} to each line, then
>>> save it the one target HDFS location.
>>>
>>> how to do it, Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>
>>
>>
>

--001a113efc821fb80805208b686e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi <span class=3D"" id=3D":1bd.1" tabindex=3D"-1">Fengdong=
</span>,<div><br></div><div>So I created two files in <span class=3D"" id=
=3D":1bd.2" tabindex=3D"-1">HDFS</span> under a test folder.</div><div><br>=
</div><div>test/<span class=3D"" id=3D":1bd.3" tabindex=3D"-1">dt</span>=3D=
20100101.<span class=3D"" id=3D":1bd.4" tabindex=3D"-1">json</span></div><d=
iv>{ &quot;key1&quot; : &quot;value1&quot; }</div><div><br></div><div>test/=
<span class=3D"" id=3D":1bd.5" tabindex=3D"-1">dt</span>=3D20100102.<span c=
lass=3D"" id=3D":1bd.6" tabindex=3D"-1">json</span></div><div>{ &quot;key2&=
quot; : &quot;value2&quot; }</div><div><br></div><div>Then inside <span cla=
ss=3D"" id=3D":1bd.7" tabindex=3D"-1">PySpark</span> shell</div><div><br></=
div><div><span class=3D"" id=3D":1bd.8" tabindex=3D"-1">rdd</span> =3D <spa=
n class=3D"" id=3D":1bd.9" tabindex=3D"-1">sc</span>.<span class=3D"" id=3D=
":1bd.10" tabindex=3D"-1">wholeTextFiles</span>(&#39;./test/*&#39;)</div><d=
iv><span class=3D"" id=3D":1bd.11" tabindex=3D"-1">rdd</span>.collect()</di=
v><div>[(<span class=3D"" id=3D":1bd.12" tabindex=3D"-1">u&#39;hdfs</span>:=
//<span class=3D"" id=3D":1bd.13" tabindex=3D"-1">localhost</span>:9000/use=
r/<span class=3D"" id=3D":1bd.14" tabindex=3D"-1">hduser</span>/test/<span =
class=3D"" id=3D":1bd.15" tabindex=3D"-1">dt</span>=3D20100101.json&#39;, u=
&#39;{ &quot;key1&quot; : &quot;value1&quot; }), (<span class=3D"" id=3D":1=
bd.16" tabindex=3D"-1">u&#39;hdfs</span>://<span class=3D"" id=3D":1bd.17" =
tabindex=3D"-1">localhost</span>:9000/user/<span class=3D"" id=3D":1bd.18" =
tabindex=3D"-1">hduser</span>/test/<span class=3D"" id=3D":1bd.19" tabindex=
=3D"-1">dt</span>=3D20100102.json&#39;, u&#39;{ &quot;key2&quot; : &quot;va=
lue2&quot; })]<br></div><div>import <span class=3D"" id=3D":1bd.20" tabinde=
x=3D"-1">json</span></div><div>def <span class=3D"" id=3D":1bd.21" tabindex=
=3D"-1">editMe</span>(y, x):</div><div>=C2=A0 =C2=A0 =C2=A0 j =3D <span cla=
ss=3D"" id=3D":1bd.22" tabindex=3D"-1">json</span>.loads(y)</div><div>=C2=
=A0 =C2=A0 =C2=A0 j[&#39;source&#39;] =3D x</div><div>=C2=A0 =C2=A0 =C2=A0 =
return j</div><div><br></div><div><span class=3D"" id=3D":1bd.23" tabindex=
=3D"-1">rdd</span>.map(lambda (x,y): <span class=3D"" id=3D":1bd.24" tabind=
ex=3D"-1">editMe</span>(y,x)).collect()<br></div><div>[{&#39;source&#39;: <=
span class=3D"" id=3D":1bd.25" tabindex=3D"-1">u&#39;hdfs</span>://<span cl=
ass=3D"" id=3D":1bd.26" tabindex=3D"-1">localhost</span>:9000/user/<span cl=
ass=3D"" id=3D":1bd.27" tabindex=3D"-1">hduser</span>/test/<span class=3D""=
 id=3D":1bd.28" tabindex=3D"-1">dt</span>=3D20100101.json&#39;, u&#39;key1&=
#39;: u&#39;value1&#39;}, {u&#39;key2&#39;: u&#39;value2&#39;, &#39;source&=
#39;: <span class=3D"" id=3D":1bd.29" tabindex=3D"-1">u&#39;hdfs</span>://<=
span class=3D"" id=3D":1bd.30" tabindex=3D"-1">localhost</span>:9000/user/<=
span class=3D"" id=3D":1bd.31" tabindex=3D"-1">hduser</span>/test/<span cla=
ss=3D"" id=3D":1bd.32" tabindex=3D"-1">dt</span>=3D20100102.json&#39;}]<br>=
</div><div><br></div><div>Similarly you could modify the function to return=
 &#39;source&#39; and &#39;date&#39; with some string manipulation per your=
 requirements.</div><div><br></div><div>Let me know if this helps.</div><di=
v><br></div><div>Thanks,</div><div><span class=3D"" id=3D":1bd.33" tabindex=
=3D"-1">Anchit</span></div><div><br></div></div><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">On 24 September 2015 at 23:55, Fengdong Yu <=
span dir=3D"ltr">&lt;<a href=3D"mailto:fengdongy@everstring.com" target=3D"=
_blank">fengdongy@everstring.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div style=3D"word-wrap:break-word"><br><div><div>yes. such a=
s I have two data sets:</div><div><br></div><div>date set A: /data/test1/dt=
=3D20100101</div><div>data set B: /data/test2/dt=3D20100202</div><div><br><=
/div><div><br></div><div>all data has the same JSON format , such as:</div>=
<div>{=E2=80=9Ckey1=E2=80=9D : =E2=80=9Cvalue1=E2=80=9D, =E2=80=9Ckey2=E2=
=80=9D : =E2=80=9Cvalue2=E2=80=9D }</div><div><br></div><div><br></div><div=
>my output expected:</div><div>{=E2=80=9Ckey1=E2=80=9D : =E2=80=9Cvalue1=E2=
=80=9D, =E2=80=9Ckey2=E2=80=9D : =E2=80=9Cvalue2=E2=80=9D , <font color=3D"=
#b92d5d">=E2=80=9Csource=E2=80=9D : =E2=80=9Ctest1=E2=80=9D, =E2=80=9Cdate=
=E2=80=9D : =E2=80=9C20100101&quot;</font>}</div><div>{=E2=80=9Ckey1=E2=80=
=9D : =E2=80=9Cvalue1=E2=80=9D, =E2=80=9Ckey2=E2=80=9D : =E2=80=9Cvalue2=E2=
=80=9D , <font color=3D"#ff6a00">=E2=80=9Csource=E2=80=9D : =E2=80=9Ctest2=
=E2=80=9D, =E2=80=9Cdate=E2=80=9D : =E2=80=9C20100202&quot;</font>}<br><br>=
</div>

</div><div><div class=3D"h5">
<br><div><blockquote type=3D"cite"><div>On Sep 25, 2015, at 11:52, Anchit C=
houdhry &lt;<a href=3D"mailto:anchit.choudhry@gmail.com" target=3D"_blank">=
anchit.choudhry@gmail.com</a>&gt; wrote:</div><br><div><p dir=3D"ltr">Sure.=
 May I ask for a sample input(could be just few lines) and the output you a=
re expecting to bring clarity to my thoughts?<br>
</p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr">On Thu, Sep 24, 2015, 23:44=
=C2=A0Fengdong Yu &lt;<a href=3D"mailto:fengdongy@everstring.com" target=3D=
"_blank">fengdongy@everstring.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div style=3D"word-wrap:break-word">Hi Anchit,=C2=A0<div><br>=
</div><div>Thanks for the quick answer.</div><div><br></div><div>my exact q=
uestion is : I want to add HDFS location into each line in my JSON =C2=A0da=
ta.</div></div><div style=3D"word-wrap:break-word"><div><br><div><div><br><=
/div>

</div>
<br><div><blockquote type=3D"cite"><div>On Sep 25, 2015, at 11:25, Anchit C=
houdhry &lt;<a href=3D"mailto:anchit.choudhry@gmail.com" target=3D"_blank">=
anchit.choudhry@gmail.com</a>&gt; wrote:</div><br><div><div dir=3D"ltr">Hi =
<span>Fengdong</span>,<div><br></div><div>Thanks for your question.</div><d=
iv><br></div><div>Spark already has a function called <span>wholeTextFiles<=
/span> within <span>sparkContext</span> which can help you with that:</div>=
<div><br></div><div>Python</div><div><pre style=3D"padding:10px;line-height=
:1.2em;border:1px solid rgb(198,201,203);font-size:1.1em;margin-top:1.5em;m=
argin-bottom:1.5em"><span>hdfs</span>://a-<span>hdfs</span>-path/part-00000
<span>hdfs</span>://a-<span>hdfs</span>-path/part-00001
...
<span>hdfs</span>://a-<span>hdfs</span>-path/part-<span>nnnnn</span></pre><=
/div><div><pre style=3D"padding:10px;line-height:1.2em;border:1px solid rgb=
(198,201,203);font-size:1.1em;margin-top:1.5em;margin-bottom:1.5em"><span s=
tyle=3D"color:rgb(62,67,73);font-family:Arial,sans-serif;font-size:14.4px;f=
ont-style:italic;line-height:21.6px;white-space:normal"><span>rdd</span> =
=3D <span>sparkContext</span>.</span><span style=3D"color:rgb(62,67,73);fon=
t-family:Arial,sans-serif;font-size:14.4px;font-style:italic;line-height:21=
.6px;white-space:normal;background-color:rgb(251,229,78)"><span>wholeText</=
span></span><span style=3D"color:rgb(62,67,73);font-family:Arial,sans-serif=
;font-size:14.4px;font-style:italic;line-height:21.6px;white-space:normal">=
Files(=E2=80=9C<a>hdfs://a-</a><span>hdfs</span>-path=E2=80=9D)</span><br><=
/pre><pre style=3D"padding:10px;line-height:1.2em;border:1px solid rgb(198,=
201,203);font-size:1.1em;margin-top:1.5em;margin-bottom:1.5em">(a-<span>hdf=
s</span>-path/part-00000, its content)
(a-<span>hdfs</span>-path/part-00001, its content)
...
(a-<span>hdfs</span>-path/part-<span>nnnnn</span>, its content)</pre></div>=
<div>More info:=C2=A0<a href=3D"http://spark/" target=3D"_blank">http://spa=
rk</a>.<span>apache</span>.org/docs/latest/api/python/<span>pyspark</span>.=
html?highlight=3D<span>wholetext</span>#<span>pyspark</span>.<span>SparkCon=
text</span>.<span>wholeTextFiles</span></div><div><br></div><div>----------=
--</div><div><br></div><div><span>Scala</span></div><div><br></div><div><sp=
an style=3D"font-family:monospace;font-size:13.3333px">val <span>rdd</span>=
 =3D <span>sparkContext</span>.<span>wholeTextFile</span>(&quot;<span>hdfs<=
/span>://a-<span>hdfs</span>-path&quot;)</span></div><div><span style=3D"fo=
nt-family:monospace;font-size:13.3333px"><br></span></div><div><span style=
=3D"font-size:13.3333px"><font face=3D"arial, helvetica, sans-serif">More i=
nfo:=C2=A0</font></span><font face=3D"arial, helvetica, sans-serif"><span s=
tyle=3D"font-size:13.3333px"><span>https</span>://spark.<span>apache</span>=
.org/docs/latest/api/<span>scala</span>/index.html#org.<span>apache</span>.=
spark.<span>SparkContext</span>@<span>wholeTextFiles</span>(String,Int):<sp=
an>RDD</span>[(String,String)]</span></font></div><div>=C2=A0</div><div>Let=
 us know if this helps or you need more help.</div><div><br></div><div>Than=
ks,</div><div><span>Anchit</span> <span>Choudhry</span></div></div><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On 24 September 2015 at 2=
3:12, Fengdong Yu <span dir=3D"ltr">&lt;<a href=3D"mailto:fengdongy@everstr=
ing.com" target=3D"_blank">fengdongy@everstring.com</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I have=C2=A0 multiple files with JSON format, such as:<br>
<br>
/data/test1_data/sub100/test.data<br>
/data/test2_data/sub200/test.data<br>
<br>
<br>
I can sc.textFile(=E2=80=9C/data/*/*=E2=80=9D)<br>
<br>
but I want to add the {=E2=80=9Csource=E2=80=9D : =E2=80=9CHDFS_LOCATION=E2=
=80=9D} to each line, then save it the one target HDFS location.<br>
<br>
how to do it, Thanks.<br>
<br>
<br>
<br>
<br>
<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@spark.apache.org"=
 target=3D"_blank">dev-unsubscribe@spark.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:dev-help@spark.apache.or=
g" target=3D"_blank">dev-help@spark.apache.org</a><br>
<br>
</blockquote></div><br></div>
</div></blockquote></div><br></div></div></blockquote></div>
</div></blockquote></div><br></div></div></div></blockquote></div><br></div=
>

--001a113efc821fb80805208b686e--