Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <d5eda63844bffc09e53902a8b1122aac@mail.gmail.com>
References: <0ec50ca5b3121be333876ab43b19eab8@mail.gmail.com>
 <CAAswR-5w0QqzSpD5OyDnLbWtzOYGg6K81RcHSZAyd4xPqairZw@mail.gmail.com> <d5eda63844bffc09e53902a8b1122aac@mail.gmail.com>
From: Hyukjin Kwon <gurwls223@gmail.com>
Date: Wed, 7 Dec 2016 19:19:05 +0900
Message-ID: <CAMFhwAbUyf5Ehe6j2zW4798TURmwXA_8c5790r8o+h6hjmQ64Q@mail.gmail.com>
Subject: Re: get corrupted rows using columnNameOfCorruptRecord
To: Yehuda Finkelstein <yehuda@veracity-group.com>
Cc: Michael Armbrust <michael@databricks.com>, user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a113ecaa8c5113f05430ed9c0
archived-at: Wed, 07 Dec 2016 10:19:16 -0000

--001a113ecaa8c5113f05430ed9c0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Let me please just extend the suggestion a bit more verbosely.

I think you could try something like this maybe.

val jsonDF =3D spark.read
  .option("columnNameOfCorruptRecord", "xxx")
  .option("mode","PERMISSIVE")
  .schema(StructType(schema.fields :+ StructField("xxx", StringType, true))=
)
  .json(corruptRecords)
val malformed =3D jsonDF.filter("xxx is not null").select("xxx")
malformed.show()

This prints something like the ones below:

+------------+
|         xxx|
+------------+
|           {|
|{"a":1, b:2}|
|{"a":{, b:3}|
|           ]|
+------------+

=E2=80=8B

If the schema is not specified, then the inferred schema has the malformed
column automatically

but in case of specifying the schema, I believe this should be manually set=
.


2016-12-07 18:06 GMT+09:00 Yehuda Finkelstein <yehuda@veracity-group.com>:

> Hi
>
>
>
> I tried it already but it say that this column doesn=E2=80=99t exists.
>
>
>
> scala> var df =3D spark.sqlContext.read.
>
>      | option("columnNameOfCorruptRecord","xxx").
>
>      | option("mode","PERMISSIVE").
>
>      | schema(df_schema.schema).json(f)
>
> df: org.apache.spark.sql.DataFrame =3D [auctionid: string, timestamp: str=
ing
> ... 37 more fields]
>
>
>
> scala> df.select
>
> select   selectExpr
>
>
>
> scala> df.select("xxx").show
>
> org.apache.spark.sql.AnalysisException: cannot resolve '`xxx`' given
> input columns: [=E2=80=A6];;
>
>
>
>   at org.apache.spark.sql.catalyst.analysis.package$
> AnalysisErrorAt.failAnalysis(package.scala:42)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:77)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(
> CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:308)
>
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(
> TreeNode.scala:307)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.
> transformExpressionUp$1(QueryPlan.scala:269)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:279)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$
> apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(
> QueryPlan.scala:283)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
>   at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234=
)
>
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$
> spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$
> 2(QueryPlan.scala:283)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.
> apply(QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:186)
>
>   at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp=
(
> QueryPlan.scala:288)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$
> anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(
> TreeNode.scala:126)
>
>   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.
> checkAnalysis(CheckAnalysis.scala:67)
>
>   at org.apache.spark.sql.catalyst.analysis.Analyzer.
> checkAnalysis(Analyzer.scala:58)
>
>   at org.apache.spark.sql.execution.QueryExecution.
> assertAnalyzed(QueryExecution.scala:49)
>
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>
>   at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
> withPlan(Dataset.scala:2603)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
>
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:987)
>
>   ... 48 elided
>
>
>
> scala>
>
>
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Tuesday, December 06, 2016 10:26 PM
> *To:* Yehuda Finkelstein
> *Cc:* user
> *Subject:* Re: get corrupted rows using columnNameOfCorruptRecord
>
>
>
> .where("xxx IS NOT NULL") will give you the rows that couldn't be parsed.
>
>
>
> On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein <
> yehuda@veracity-group.com> wrote:
>
> Hi all
>
>
>
> I=E2=80=99m trying to parse json using existing schema and got rows with =
NULL=E2=80=99s
>
> //get schema
>
> val df_schema =3D spark.sqlContext.sql("select c1,c2,=E2=80=A6cn t1  limi=
t 1")
>
> //read json file
>
> val f =3D sc.textFile("/tmp/x")
>
> //load json into data frame using schema
>
> var df =3D spark.sqlContext.read.option("columnNameOfCorruptRecord","
> xxx").option("mode","PERMISSIVE").schema(df_schema.schema).json(f)
>
>
>
> in documentation it say that you can query the corrupted rows by this
> columns =C3=A0 columnNameOfCorruptRecord
>
> o    =E2=80=9CcolumnNameOfCorruptRecord (default is the value specified i=
n
> spark.sql.columnNameOfCorruptRecord): allows renaming the new field
> having malformed string created by PERMISSIVE mode. This overrides
> spark.sql.columnNameOfCorruptRecord.=E2=80=9D
>
>
>
> The question is how to fetch those corrupted rows ?
>
>
>
>
>
> Thanks
>
> Yehuda
>
>
>
>
>
>
>

--001a113ecaa8c5113f05430ed9c0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail-markdown-here-wrapper"><p style=3D"mar=
gin:0px 0px 1.2em">Let me please just extend the suggestion a bit more verb=
osely.</p><p style=3D"margin:0px 0px 1.2em">I think you could try something=
 like this maybe.<br></p>
<pre style=3D"font-family:consolas,inconsolata,courier,monospace;font-size:=
1em;line-height:1.2em;margin:1.2em 0px"><code style=3D"font-size:0.85em;fon=
t-family:consolas,inconsolata,courier,monospace;margin:0px 0.15em;backgroun=
d-color:rgb(248,248,248);white-space:pre;overflow:auto;border-radius:3px;bo=
rder:1px solid rgb(204,204,204);padding:0.5em 0.7em;display:block">val json=
DF =3D spark.read
  .option(&quot;columnNameOfCorruptRecord&quot;, &quot;xxx&quot;)
  .option(&quot;mode&quot;,&quot;PERMISSIVE&quot;)
  .schema(StructType(schema.fields :+ StructField(&quot;xxx&quot;, StringTy=
pe, true)))
  .json(corruptRecords)
val malformed =3D jsonDF.filter(&quot;xxx is not null&quot;).select(&quot;x=
xx&quot;)
malformed.show()
</code></pre><p style=3D"margin:0px 0px 1.2em">This prints something like t=
he ones below:</p>
<pre style=3D"font-family:consolas,inconsolata,courier,monospace;font-size:=
1em;line-height:1.2em;margin:1.2em 0px"><code style=3D"font-size:0.85em;fon=
t-family:consolas,inconsolata,courier,monospace;margin:0px 0.15em;backgroun=
d-color:rgb(248,248,248);white-space:pre;overflow:auto;border-radius:3px;bo=
rder:1px solid rgb(204,204,204);padding:0.5em 0.7em;display:block">+-------=
-----+
|         xxx|
+------------+
|           {|
|{&quot;a&quot;:1, b:2}|
|{&quot;a&quot;:{, b:3}|
|           ]|
+------------+
</code></pre><div title=3D"MDH:WW91IGNvdWxkIHRyeSBzb21ldGhpbmcgbGlrZSB0aGlz=
IG1heWJlLjxicj48YnI+YGBgPGRpdj48
cHJlIHN0eWxlPSJjb2xvcjogcmdiKDAsIDAsIDApOyBmb250LWZhbWlseTogbWVubG87IGZvbnQ=
t
c2l6ZTogMTJwdDsiPjxwcmUgc3R5bGU9ImZvbnQtZmFtaWx5OiBtZW5sbzsgZm9udC1zaXplOiA=
x
MnB0OyI+PHNwYW4gc3R5bGU9ImNvbG9yOiByZ2IoMCwgMCwgMTI4KTsgZm9udC13ZWlnaHQ6IGJ=
v
bGQ7Ij52YWwgPC9zcGFuPmpzb25ERiA9IHNwYXJrLnJlYWQ8YnI+ICAub3B0aW9uKDxzcGFuIHN=
0
eWxlPSJjb2xvcjogcmdiKDAsIDEyOCwgMCk7IGZvbnQtd2VpZ2h0OiBib2xkOyI+ImNvbHVtbk5=
h
bWVPZkNvcnJ1cHRSZWNvcmQiPC9zcGFuPiwgPHNwYW4gc3R5bGU9ImNvbG9yOiByZ2IoMCwgMTI=
4
LCAwKTsgZm9udC13ZWlnaHQ6IGJvbGQ7Ij4ieHh4Ijwvc3Bhbj4pPGJyPiAgLm9wdGlvbig8c3B=
h
biBzdHlsZT0iY29sb3I6IHJnYigwLCAxMjgsIDApOyBmb250LXdlaWdodDogYm9sZDsiPiJtb2R=
l
Ijwvc3Bhbj4sPHNwYW4gc3R5bGU9ImNvbG9yOiByZ2IoMCwgMTI4LCAwKTsgZm9udC13ZWlnaHQ=
6
IGJvbGQ7Ij4iUEVSTUlTU0lWRSI8L3NwYW4+KTxicj4gIC5zY2hlbWEoPHNwYW4gc3R5bGU9ImZ=
v
bnQtc3R5bGU6IGl0YWxpYzsiPlN0cnVjdFR5cGU8L3NwYW4+KHNjaGVtYS5maWVsZHMgOisgPHN=
w
YW4gc3R5bGU9ImZvbnQtc3R5bGU6IGl0YWxpYzsiPlN0cnVjdEZpZWxkPC9zcGFuPig8c3BhbiB=
z
dHlsZT0iY29sb3I6IHJnYigwLCAxMjgsIDApOyBmb250LXdlaWdodDogYm9sZDsiPiJ4eHgiPC9=
z
cGFuPiwgU3RyaW5nVHlwZSwgPHNwYW4gc3R5bGU9ImNvbG9yOiByZ2IoMCwgMCwgMTI4KTsgZm9=
u
dC13ZWlnaHQ6IGJvbGQ7Ij50cnVlPC9zcGFuPikpKTxicj4gIC5qc29uKGNvcnJ1cHRSZWNvcmR=
z
KTxicj48c3BhbiBzdHlsZT0iY29sb3I6IHJnYigwLCAwLCAxMjgpOyBmb250LXdlaWdodDogYm9=
s
ZDsiPnZhbCA8L3NwYW4+bWFsZm9ybWVkID0ganNvbkRGLmZpbHRlcig8c3BhbiBzdHlsZT0iY29=
s
b3I6IHJnYigwLCAxMjgsIDApOyBmb250LXdlaWdodDogYm9sZDsiPiJ4eHggaXMgbm90IG51bGw=
i
PC9zcGFuPikuc2VsZWN0KDxzcGFuIHN0eWxlPSJjb2xvcjogcmdiKDAsIDEyOCwgMCk7IGZvbnQ=
t
d2VpZ2h0OiBib2xkOyI+Inh4eCI8L3NwYW4+KTxicj5tYWxmb3JtZWQuc2hvdygpPC9wcmU+PC9=
w
cmU+YGBgPC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj5UaGlzIHByaW50cyBzb21ldGhpbmcgbGl=
r
ZSB0aGUgb25lcyBiZWxvdzo8L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PmBgYDwvZGl2PjxkaXY=
+
PGRpdj4rLS0tLS0tLS0tLS0tKzwvZGl2PjxkaXY+fCAmbmJzcDsgJm5ic3A7ICZuYnNwOyAmbmJ=
z
cDsgeHh4fDwvZGl2PjxkaXY+Ky0tLS0tLS0tLS0tLSs8L2Rpdj48ZGl2PnwgJm5ic3A7ICZuYnN=
w
OyAmbmJzcDsgJm5ic3A7ICZuYnNwOyB7fDwvZGl2PjxkaXY+fHsiYSI6MSwgYjoyfXw8L2Rpdj4=
8
ZGl2Pnx7ImEiOnssIGI6M318PC9kaXY+PGRpdj58ICZuYnNwOyAmbmJzcDsgJm5ic3A7ICZuYnN=
w
OyAmbmJzcDsgXXw8L2Rpdj48ZGl2PistLS0tLS0tLS0tLS0rPC9kaXY+PC9kaXY+PGRpdj5gYGA=
8
L2Rpdj4=3D" style=3D"height:0px;width:0px;max-height:0px;max-width:0px;over=
flow:hidden;font-size:0em;padding:0px;margin:0px">=E2=80=8B</div></div><div=
></div><div><br></div><div>If the schema is not specified, then the inferre=
d schema has the malformed column automatically</div><div><br></div><div>bu=
t in case of specifying the schema, I believe this should be manually set.<=
/div><div><br></div><div><br></div><div><br></div><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">2016-12-07 18:06 GMT+09:00 Yehuda Finkelst=
ein <span dir=3D"ltr">&lt;<a href=3D"mailto:yehuda@veracity-group.com" targ=
et=3D"_blank">yehuda@veracity-group.com</a>&gt;</span>:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex"><div lang=3D"EN-US"><div class=3D"gmail-m=
_2168407411560931023WordSection1"><p class=3D"MsoNormal"><span style=3D"fon=
t-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">Hi </span>=
</p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibr=
i,sans-serif;color:rgb(31,73,125)">=C2=A0</span></p><p class=3D"MsoNormal">=
<span style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,7=
3,125)">I tried it already but it say that this column doesn=E2=80=99t exis=
ts.</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-fam=
ily:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0</span></p><p class=3D"M=
soNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;colo=
r:rgb(31,73,125)">scala&gt; var df =3D spark.sqlContext.read.</span></p><p =
class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-=
serif;color:rgb(31,73,125)">=C2=A0=C2=A0=C2=A0=C2=A0 | option(&quot;<wbr>co=
lumnNameOfCorruptRecord&quot;,&quot;<wbr>xxx&quot;).</span></p><p class=3D"=
MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;col=
or:rgb(31,73,125)">=C2=A0=C2=A0=C2=A0=C2=A0 | option(&quot;mode&quot;,&quot=
;PERMISSIVE&quot;).</span></p><p class=3D"MsoNormal"><span style=3D"font-si=
ze:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0=C2=A0=
=C2=A0=C2=A0 | schema(df_schema.schema).json(<wbr>f)</span></p><p class=3D"=
MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;col=
or:rgb(31,73,125)">df: org.apache.spark.sql.DataFrame =3D [auctionid: strin=
g, timestamp: string ... 37 more fields]</span></p><p class=3D"MsoNormal"><=
span style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73=
,125)">=C2=A0</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11p=
t;font-family:calibri,sans-serif;color:rgb(31,73,125)">scala&gt; df.select<=
/span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:=
calibri,sans-serif;color:rgb(31,73,125)">select=C2=A0=C2=A0 selectExpr</spa=
n></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:cali=
bri,sans-serif;color:rgb(31,73,125)">=C2=A0</span></p><p class=3D"MsoNormal=
"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31=
,73,125)">scala&gt; df.select(&quot;xxx&quot;).show</span></p><p class=3D"M=
soNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;colo=
r:rgb(31,73,125)">org.apache.spark.sql.<wbr>AnalysisException: cannot resol=
ve &#39;`xxx`&#39; given input columns: [=E2=80=A6];;</span></p><p class=3D=
"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;co=
lor:rgb(31,73,125)">=C2=A0</span></p><p class=3D"MsoNormal"><span style=3D"=
font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 =
at org.apache.spark.sql.catalyst.<wbr>analysis.package$<wbr>AnalysisErrorAt=
.failAnalysis(<wbr>package.scala:42)</span></p><p class=3D"MsoNormal"><span=
 style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125=
)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>analysis.CheckAnalysis$$<wb=
r>anonfun$checkAnalysis$1$$<wbr>anonfun$apply$2.applyOrElse(<wbr>CheckAnaly=
sis.scala:77)</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11p=
t;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache=
.spark.sql.catalyst.<wbr>analysis.CheckAnalysis$$<wbr>anonfun$checkAnalysis=
$1$$<wbr>anonfun$apply$2.applyOrElse(<wbr>CheckAnalysis.scala:74)</span></p=
><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,s=
ans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wb=
r>trees.TreeNode$$anonfun$<wbr>transformUp$1.apply(TreeNode.<wbr>scala:308)=
</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family=
:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.ca=
talyst.<wbr>trees.TreeNode$$anonfun$<wbr>transformUp$1.apply(TreeNode.<wbr>=
scala:308)</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;f=
ont-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.sp=
ark.sql.catalyst.<wbr>trees.CurrentOrigin$.<wbr>withOrigin(TreeNode.scala:6=
9)</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-fami=
ly:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.=
catalyst.<wbr>trees.TreeNode.transformUp(<wbr>TreeNode.scala:307)</span></p=
><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,s=
ans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wb=
r>plans.QueryPlan.<wbr>transformExpressionUp$1(<wbr>QueryPlan.scala:269)</s=
pan></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:ca=
libri,sans-serif;color:rgb(31,73,125)">=C2=A0 at <a href=3D"http://org.apac=
he.spark.sql.catalyst.plans.QueryPlan.org" target=3D"_blank">org.apache.spa=
rk.sql.catalyst.<wbr>plans.QueryPlan.org</a>$apache$<wbr>spark$sql$catalyst=
$plans$<wbr>QueryPlan$$recursiveTransform$<wbr>2(QueryPlan.scala:279)</span=
></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calib=
ri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst=
.<wbr>plans.QueryPlan$$anonfun$org$<wbr>apache$spark$sql$catalyst$<wbr>plan=
s$QueryPlan$$<wbr>recursiveTransform$2$1.apply(<wbr>QueryPlan.scala:283)</s=
pan></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:ca=
libri,sans-serif;color:rgb(31,73,125)">=C2=A0 at scala.collection.<wbr>Trav=
ersableLike$$anonfun$map$<wbr>1.apply(TraversableLike.scala:<wbr>234)</span=
></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calib=
ri,sans-serif;color:rgb(31,73,125)">=C2=A0 at scala.collection.<wbr>Travers=
ableLike$$anonfun$map$<wbr>1.apply(TraversableLike.scala:<wbr>234)</span></=
p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,=
sans-serif;color:rgb(31,73,125)">=C2=A0 at scala.collection.mutable.<wbr>Re=
sizableArray$class.foreach(<wbr>ResizableArray.scala:59)</span></p><p class=
=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif=
;color:rgb(31,73,125)">=C2=A0 at scala.collection.mutable.<wbr>ArrayBuffer.=
foreach(<wbr>ArrayBuffer.scala:48)</span></p><p class=3D"MsoNormal"><span s=
tyle=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)"=
>=C2=A0 at scala.collection.<wbr>TraversableLike$class.map(<wbr>Traversable=
Like.scala:234)</span></p><p class=3D"MsoNormal"><span style=3D"font-size:1=
1pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at scala.co=
llection.<wbr>AbstractTraversable.map(<wbr>Traversable.scala:104)</span></p=
><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,s=
ans-serif;color:rgb(31,73,125)">=C2=A0 at <a href=3D"http://org.apache.spar=
k.sql.catalyst.plans.QueryPlan.org" target=3D"_blank">org.apache.spark.sql.=
catalyst.<wbr>plans.QueryPlan.org</a>$apache$<wbr>spark$sql$catalyst$plans$=
<wbr>QueryPlan$$recursiveTransform$<wbr>2(QueryPlan.scala:283)</span></p><p=
 class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans=
-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>p=
lans.QueryPlan$$anonfun$8.<wbr>apply(QueryPlan.scala:288)</span></p><p clas=
s=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-seri=
f;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>trees.=
TreeNode.<wbr>mapProductIterator(TreeNode.<wbr>scala:186)</span></p><p clas=
s=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-seri=
f;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>plans.=
QueryPlan.<wbr>transformExpressionsUp(<wbr>QueryPlan.scala:288)</span></p><=
p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,san=
s-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>=
analysis.CheckAnalysis$$<wbr>anonfun$checkAnalysis$1.apply(<wbr>CheckAnalys=
is.scala:74)</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt=
;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.=
spark.sql.catalyst.<wbr>analysis.CheckAnalysis$$<wbr>anonfun$checkAnalysis$=
1.apply(<wbr>CheckAnalysis.scala:67)</span></p><p class=3D"MsoNormal"><span=
 style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125=
)">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>trees.TreeNode.foreachUp(<w=
br>TreeNode.scala:126)</span></p><p class=3D"MsoNormal"><span style=3D"font=
-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at o=
rg.apache.spark.sql.catalyst.<wbr>analysis.CheckAnalysis$class.<wbr>checkAn=
alysis(CheckAnalysis.<wbr>scala:67)</span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)=
">=C2=A0 at org.apache.spark.sql.catalyst.<wbr>analysis.Analyzer.<wbr>check=
Analysis(Analyzer.scala:<wbr>58)</span></p><p class=3D"MsoNormal"><span sty=
le=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=
=C2=A0 at org.apache.spark.sql.<wbr>execution.QueryExecution.<wbr>assertAna=
lyzed(QueryExecution.<wbr>scala:49)</span></p><p class=3D"MsoNormal"><span =
style=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)=
">=C2=A0 at org.apache.spark.sql.Dataset$.<wbr>ofRows(Dataset.scala:64)</sp=
an></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:cal=
ibri,sans-serif;color:rgb(31,73,125)">=C2=A0 at <a href=3D"http://org.apach=
e.spark.sql.Dataset.org" target=3D"_blank">org.apache.spark.sql.Dataset.<wb=
r>org</a>$apache$spark$sql$Dataset$$<wbr>withPlan(Dataset.scala:2603)</span=
></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calib=
ri,sans-serif;color:rgb(31,73,125)">=C2=A0 at org.apache.spark.sql.Dataset.=
<wbr>select(Dataset.scala:969)</span></p><p class=3D"MsoNormal"><span style=
=3D"font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">=C2=
=A0 at org.apache.spark.sql.Dataset.<wbr>select(Dataset.scala:987)</span></=
p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,=
sans-serif;color:rgb(31,73,125)">=C2=A0 ... 48 elided</span></p><p class=3D=
"MsoNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;co=
lor:rgb(31,73,125)">=C2=A0</span></p><p class=3D"MsoNormal"><span style=3D"=
font-size:11pt;font-family:calibri,sans-serif;color:rgb(31,73,125)">scala&g=
t;</span></p><p class=3D"MsoNormal"><span style=3D"font-size:11pt;font-fami=
ly:calibri,sans-serif;color:rgb(31,73,125)">=C2=A0</span></p><p class=3D"Ms=
oNormal"><span style=3D"font-size:11pt;font-family:calibri,sans-serif;color=
:rgb(31,73,125)">=C2=A0</span></p><p class=3D"MsoNormal"><b><span style=3D"=
font-size:10pt;font-family:tahoma,sans-serif">From:</span></b><span style=
=3D"font-size:10pt;font-family:tahoma,sans-serif"> Michael Armbrust [mailto=
:<a href=3D"mailto:michael@databricks.com" target=3D"_blank">michael@databr=
icks.com</a><wbr>] <br><b>Sent:</b> Tuesday, December 06, 2016 10:26 PM<br>=
<b>To:</b> Yehuda Finkelstein<br><b>Cc:</b> user<br><b>Subject:</b> Re: get=
 corrupted rows using columnNameOfCorruptRecord</span></p><div><div class=
=3D"gmail-h5"><p class=3D"MsoNormal">=C2=A0</p><div><p class=3D"MsoNormal">=
.where(&quot;xxx IS NOT NULL&quot;) will give you the rows that couldn&#39;=
t be parsed.</p></div><div><p class=3D"MsoNormal">=C2=A0</p><div><p class=
=3D"MsoNormal">On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein &lt;<a hr=
ef=3D"mailto:yehuda@veracity-group.com" target=3D"_blank">yehuda@veracity-g=
roup.com</a>&gt; wrote:</p><div><div><p class=3D"MsoNormal">Hi all </p><p c=
lass=3D"MsoNormal">=C2=A0</p><p class=3D"MsoNormal">I=E2=80=99m trying to p=
arse json using existing schema and got rows with NULL=E2=80=99s</p><p clas=
s=3D"MsoNormal">//get schema </p><p class=3D"MsoNormal">val df_schema =3D s=
park.sqlContext.sql(&quot;select c1,c2,=E2=80=A6cn t1 =C2=A0limit 1&quot;)<=
/p><p class=3D"MsoNormal">//read json file </p><p class=3D"MsoNormal">val f=
 =3D sc.textFile(&quot;/tmp/x&quot;)</p><p class=3D"MsoNormal">//load json =
into data frame using schema</p><p class=3D"MsoNormal">var df =3D spark.sql=
Context.read.option(&quot;<wbr>columnNameOfCorruptRecord&quot;,&quot;<wbr>x=
xx&quot;).option(&quot;mode&quot;,&quot;<wbr>PERMISSIVE&quot;).schema(df_sc=
hema.<wbr>schema).json(f)</p><p class=3D"MsoNormal">=C2=A0</p><p class=3D"M=
soNormal">in documentation it say that you can query the corrupted rows by =
this columns <span style=3D"font-family:wingdings">=C3=A0</span> columnName=
OfCorruptRecord</p><p class=3D"MsoNormal" style=3D"background-image:initial=
;background-position:initial;background-size:initial;background-repeat:init=
ial;background-origin:initial;background-clip:initial;background-color:whit=
e;vertical-align:baseline"><span style=3D"font-size:10pt;font-family:&quot;=
courier new&quot;;color:black">o</span><span style=3D"font-size:7pt;color:b=
lack">=C2=A0=C2=A0=C2=A0 </span><span style=3D"font-size:10pt;font-family:&=
quot;courier new&quot;;color:black;border:1pt none windowtext;padding:0in">=
=E2=80=9CcolumnNameOfCorruptRecord</span><span style=3D"font-size:10pt;colo=
r:black">=C2=A0(<wbr>default is the value specified in</span><span style=3D=
"font-size:10pt;font-family:&quot;courier new&quot;;color:black;border:1pt =
none windowtext;padding:0in">spark.sql.<wbr>columnNameOfCorruptRecord</span=
><span style=3D"font-size:10pt;color:black">): allows renaming the new fiel=
d having malformed string created by=C2=A0</span><span style=3D"font-size:1=
0pt;font-family:&quot;courier new&quot;;color:black;border:1pt none windowt=
ext;padding:0in">PERMISSIVE</span><span style=3D"font-size:10pt;color:black=
">=C2=A0mode. This overrides=C2=A0</span><span style=3D"font-size:10pt;font=
-family:&quot;courier new&quot;;color:black;border:1pt none windowtext;padd=
ing:0in">spark.sql.<wbr>columnNameOfCorruptRecord</span><span style=3D"font=
-size:10pt;color:black">.=E2=80=9D</span></p><p class=3D"MsoNormal">=C2=A0<=
/p><p class=3D"MsoNormal">The question is how to fetch those corrupted rows=
 ?</p><p class=3D"MsoNormal">=C2=A0</p><p class=3D"MsoNormal">=C2=A0</p><p =
class=3D"MsoNormal">Thanks </p><p class=3D"MsoNormal"><span style=3D"color:=
rgb(136,136,136)">Yehuda</span></p><p class=3D"MsoNormal"><span style=3D"co=
lor:rgb(136,136,136)">=C2=A0</span></p><p class=3D"MsoNormal"><span style=
=3D"color:rgb(136,136,136)">=C2=A0</span></p></div></div></div><p class=3D"=
MsoNormal">=C2=A0</p></div></div></div></div></div>
</blockquote></div><br></div></div>

--001a113ecaa8c5113f05430ed9c0--