spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: Question about SPARK-11374 (skip.header.line.count)
Date Sat, 10 Dec 2016 19:00:47 GMT
+1 I think it's useful to always have a pure SQL way and skip header for plain text / csv that
lots of companies have.


________________________________
From: Dongjoon Hyun <dongjoon@apache.org>
Sent: Friday, December 9, 2016 9:42:58 AM
To: Dongjin Lee; dev@spark.apache.org
Subject: Re: Question about SPARK-11374 (skip.header.line.count)

Thank you for the opinion, Dongjin!


On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <dongjin@apache.org<mailto:dongjin@apache.org>>
wrote:
+1 For this idea. I need it also.

Regards,
Dongjin

On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <dongjoon@apache.org<mailto:dongjoon@apache.org>>
wrote:
Hi, All.





Could you give me some opinion?





There is an old SPARK issue, SPARK-11374, about removing header lines from text file.


Currently, Spark supports removing CSV header lines by the following way.





```


scala> spark.read.option("header","true").csv("/data").show


+---+---+


| c1| c2|


+---+---+


|  1|  a|


|  2|  b|


+---+---+


```





In SQL world, we can support that like the Hive way, `skip.header.line.count`.





```


scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' STORED AS TEXTFILE LOCATION '/data' TBLPROPERTIES('skip.header.line.count'='1')")


scala> sql("SELECT * FROM t1").show


+---+-----+


| id|value|


+---+-----+


|  1|    a|


|  2|    b|


+---+-----+


```





Although I made a PR for this based on the JIRA issue, I want to know this is really needed
feature.


Is it need for your use cases? Or, it's enough for you to remove them in a preprocessing stage.


If this is too old and not proper in these days, I'll close the PR and JIRA issue as WON'T
FIX.





Thank you for all in advance!





Bests,


Dongjoon.





---------------------------------------------------------------------


To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<mailto:dev-unsubscribe@spark.apache.org>








--
Dongjin Lee

Software developer in Line+.
So interested in massive-scale machine learning.

facebook: www.facebook.com/dongjin.lee.kr<http://www.facebook.com/dongjin.lee.kr>
linkedin: kr.linkedin.com/in/dongjinleekr<http://kr.linkedin.com/in/dongjinleekr>
github: <http://goog_969573159/> github.com/dongjinleekr<http://github.com/dongjinleekr>
twitter: www.twitter.com/dongjinleekr<http://www.twitter.com/dongjinleekr>



Mime
View raw message