spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SLiZn Liu <sliznmail...@gmail.com>
Subject Re: Imported CSV file content isn't identical to the original file
Date Mon, 08 Feb 2016 12:15:49 GMT
I’ve found the trigger of my issue: if I start my spark-shell or submit by
spark-submit with --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame
content goes wrong, as I described earlier.
​

On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmailbox@gmail.com> wrote:

> Thanks Luciano, now it looks like I’m the only guy who have this issue. My
> options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
> is gone.
>
> —
> Cheers,
> Todd Leo
>
>
> ​
> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1975@gmail.com>
> wrote:
>
>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>> columns seem to be read properly.
>>
>>  +----------+----------------------+
>> |C0        |C1                    |
>> +----------+----------------------+
>>
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566430 | 2015-11-04<SP>00:00:30|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> |1446566431 | 2015-11-04<SP>00:00:31|
>> +----------+----------------------+
>>
>>
>>
>>
>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmailbox@gmail.com>
>> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>>
>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>>         .option("header", "false") // Use first line of all files as header
>>>         .option("inferSchema", "false") // Automatically infer data types
>>>         .option("delimiter", " ")
>>>         .load("hdfs:///tmp/1.csv")
>>>         .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> ​
>>>
>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Mime
View raw message