spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source
Date Tue, 04 Apr 2017 01:21:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-14726.
----------------------------------
    Resolution: Won't Fix

Actually, after re-thinking, it seems we would not need this for now if not many users request
this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7)
val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema
spark.read.schema(sampledSchema).csv("/tmp/path")
{code}

Actually, this will allow more dynamic options, e.g., with replacement or without replacement
or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option or anyone feels
so.



> Support for sampling when inferring schema in CSV data source
> -------------------------------------------------------------
>
>                 Key: SPARK-14726
>                 URL: https://issues.apache.org/jira/browse/SPARK-14726
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 because it
has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data source supports
sampling ratio option.
> It would be great if CSV data source has this option too (or is this supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message