spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varsha Chandrashekar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24065) Issue with the property IgnoreLeadingWhiteSpace
Date Tue, 24 Apr 2018 09:03:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Varsha Chandrashekar updated SPARK-24065:
-----------------------------------------
    Description: 
"IgnoreLeadingWhiteSpace" property is not working properly for a corner case, Consider the
data below:
||"Col1"||"Col2"||"Col3"||
|   "A"  |   "Mark"   |   "US"   |
|   "B"   |   "Luke"   |   "UK"   |

Each cell conatins leadingWhiteSpaces and trailingWhiteSpaces, when i upload the dataset by
passing "ignoreTrailingWhiteSpace" as true, the trailing spaces are being trimmed which is
right. But, when i pass "ignoreLeadingWhiteSpace" as true it is not trimming the leading spaces.

The scenario was testes/executed in spark-shell. Refer the result below,

case 1: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop
lds1.txt")
 df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
 +----------+---------++------------
|Col1|Col2|Col3|

+----------+---------++------------
| "A" | "Mark" | "US" |
| "B" | "Luke" | "UK" |

+----------+---------++------------

case 2: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",true).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop
lds1.txt")
 df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
 +-----+----++-----
|Col1|Col2|Col3|

+-----+----++-----
|   A|Mark|US|
|   B|  Luke|  UK|

+-----+----++-----

case 3: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",true).load("C:\\Users\\vachandrashekar\\Desktop
lds1.txt")
 df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
 +-------+--------++---------
|col1|Col2|Col3|

+-------+--------++---------
|  "A"|  "Mark"|  "US"|
|  "B"|  "Luke"|  "UK"|

+-------+--------++---------

 

Analysis:

Case 1 : Works fine, with "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" as false,
the data is previewed as in the file.

 

Case 2 : Not working!! with "ignoreLeadingWhiteSpace" as true and "ignoreTrailingWhiteSpace"
as false results in trimming trailing white spaces and retains leading white spaces. 

It does trim leading white space but only for two columns in the first row excluding the first
column in that row.

 

Case 3 : Works fine, with "ignoreLeadingWhiteSpace" as false and "ignoreTrailingWhiteSpace"
as true, only trailing white spaces have been trimmed and leading white spaces are retained.

  was:
"IgnoreLeadingWhiteSpace" property is not working properly for a corner case, Consider the
data below:
||"Col1"||"Col2"||"Col3"||
|   "A"  |   "Mark"   |   "US"   |
|   "B"   |   "Luke"   |   "UK"   |

Each cell conatins leadingWhiteSpaces and trailingWhiteSpaces, when i upload the dataset by
passing "ignoreTrailingWhiteSpace" as true, the trailing spaces are being trimmed which is
right. But, when i pass "ignoreLeadingWhiteSpace" as true it is not trimming the leading spaces.

The scenario was testes/executed in spark-shell. Refer the result below,

case 1: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop\\lds1.txt")
df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
+---------+----------+------------+
| Col1| Col2| Col3|
+---------+----------+------------+
| "A" | "Mark" | "US" |
| "B" | "Luke" | "UK" |
+---------+----------+------------+

case 2: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",true).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop\\lds1.txt")
df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
+----+-----+-----+
|Col1| Col2| Col3|
+----+-----+-----+
| A|Mark|US|
| B| Luke| UK|
+----+-----+-----+

case 3: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",true).load("C:\\Users\\vachandrashekar\\Desktop\\lds1.txt")
df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]

scala> df.show()
+------+---------+---------+
| col1| Col2| Col3|
+------+---------+---------+
| "A"| "Mark"| "US"|
| "B"| "Luke"| "UK"|
+------+---------+---------+

 

Analysis:

Case 1 : Works fine, with "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" as false,
the data is previewed as in the file.

 

Case 2 : Not working!! with "ignoreLeadingWhiteSpace" as true and "ignoreTrailingWhiteSpace"
as false results in trimming trailing white spaces and retains leading white spaces. 

It does trim leading white space but only for two columns in the first row excluding the first
column in that row.

 

Case 3 : Works fine, with "ignoreLeadingWhiteSpace" as false and "ignoreTrailingWhiteSpace"
as true, only trailing white spaces have been trimmed and leading white spaces are retained.


> Issue with the property IgnoreLeadingWhiteSpace
> -----------------------------------------------
>
>                 Key: SPARK-24065
>                 URL: https://issues.apache.org/jira/browse/SPARK-24065
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Varsha Chandrashekar
>            Priority: Major
>
> "IgnoreLeadingWhiteSpace" property is not working properly for a corner case, Consider
the data below:
> ||"Col1"||"Col2"||"Col3"||
> |   "A"  |   "Mark"   |   "US"   |
> |   "B"   |   "Luke"   |   "UK"   |
> Each cell conatins leadingWhiteSpaces and trailingWhiteSpaces, when i upload the dataset
by passing "ignoreTrailingWhiteSpace" as true, the trailing spaces are being trimmed which
is right. But, when i pass "ignoreLeadingWhiteSpace" as true it is not trimming the leading
spaces.
> The scenario was testes/executed in spark-shell. Refer the result below,
> case 1: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop
> lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]
> scala> df.show()
>  +----------+---------++------------
> |Col1|Col2|Col3|
> +----------+---------++------------
> | "A" | "Mark" | "US" |
> | "B" | "Luke" | "UK" |
> +----------+---------++------------
> case 2: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",true).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar\\Desktop
> lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]
> scala> df.show()
>  +-----+----++-----
> |Col1|Col2|Col3|
> +-----+----++-----
> |   A|Mark|US|
> |   B|  Luke|  UK|
> +-----+----++-----
> case 3: scala> var df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",true).load("C:\\Users\\vachandrashekar\\Desktop
> lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more field]
> scala> df.show()
>  +-------+--------++---------
> |col1|Col2|Col3|
> +-------+--------++---------
> |  "A"|  "Mark"|  "US"|
> |  "B"|  "Luke"|  "UK"|
> +-------+--------++---------
>  
> Analysis:
> Case 1 : Works fine, with "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" as
false, the data is previewed as in the file.
>  
> Case 2 : Not working!! with "ignoreLeadingWhiteSpace" as true and "ignoreTrailingWhiteSpace"
as false results in trimming trailing white spaces and retains leading white spaces. 
> It does trim leading white space but only for two columns in the first row excluding
the first column in that row.
>  
> Case 3 : Works fine, with "ignoreLeadingWhiteSpace" as false and "ignoreTrailingWhiteSpace"
as true, only trailing white spaces have been trimmed and leading white spaces are retained.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message