spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Checking for null values when mapping
Date Sat, 20 Feb 2016 14:31:18 GMT
Yes I did that as well but no joy. My shell does it for windows files automatically

 

Thanks, 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 20 February 2016 14:27
To: Mich Talebzadeh <mich@peridale.co.uk>
Cc: user @spark <user@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/)

 

Has helped me in the past to deal with special characters while using windows based CSV’s
in Linux. (Might not be the solution here.. Just an FYI :))

 

On Feb 20, 2016, at 2:17 PM, Chandeep Singh <cs@chandeep.com <mailto:cs@chandeep.com>
> wrote:

 

Understood. In that case Ted’s suggestion to check the length should solve the problem.

 

On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk>
> wrote:

 

Hi,

 

That is a good question.

 

When data is exported from CSV to Linux, any character that cannot be transformed is replaced
by ?. That question mark is not actually the expected “?” :)

 

So the only way I can get rid of it is by drooping the first character using substring(1).
I checked I did the same in Hive sql

 

The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

 

From: Chandeep Singh [mailto:cs@chandeep.com] 
Sent: 20 February 2016 13:47
To: Mich Talebzadeh <mich@peridale.co.uk <mailto:mich@peridale.co.uk> >
Cc: user @spark <user@spark.apache.org <mailto:user@spark.apache.org> >
Subject: Re: Checking for null values when mapping

 

Looks like you’re using substring just to get rid of the ‘?’. Why not use replace for
that as well? And then you wouldn’t run into issues with index out of bound.

 

val a = "?1,187.50"  

val b = ""

 

println(a.substring(1).replace(",", "”))

—> 1187.50

 

println(a.replace("?", "").replace(",", "”))

—> 1187.50

 

println(b.replace("?", "").replace(",", "”))

—> No error / output since both ‘?' and ‘,' don’t exist.

 

 

On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh < <mailto:mich@peridale.co.uk> mich@peridale.co.uk>
wrote:

 

 

I have a DF like below reading a csv file

 

 

val df = HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header",
"true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1), x.getString(2).substring(1).replace(",",
"").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, x.getString(4).substring(1).replace(",",
"").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine. However, at the bottom
of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1) on empty values
that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do the substring
operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.

 

 


Mime
View raw message