spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harry Brundage (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land
Date Thu, 23 Apr 2015 00:48:38 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Harry Brundage updated SPARK-6917:
----------------------------------
    Description: 
When trying to access data stored in a Parquet file with an INT96 column (read: TimestampType()
encoded for Impala), if the INT96 column is included in the fetched data, other, smaller numeric
types come back broken.

{code}
In [1]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col',
'long_col').first()
Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))

In [2]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first()
Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, str_col=u'Hello!', int_col={u'__class__':
u'scala.runtime.BoxedUnit'}, date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo
'America/Toronto' EDT-1 day, 19:00:00 DST>))
{code}

Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being returned for the {{int_col}}
and {{long_col}} columns in the second loop above. This only happens if I select the {{date_col}}
which is stored as {{INT96}}. 

I don't know much about Scala boxing, but I assume that somehow by including numeric columns
that are bigger than a machine word I trigger some different, slower execution path somewhere
that boxes stuff and causes this problem.

If anyone could give me any pointers on where to get started fixing this I'd be happy to dive
in!

  was:
When trying to access data stored in a Parquet file with an INT96 column (read: TimestampType()
encoded for Impala), if the INT96 column is included in the fetched data, other, smaller numeric
types come back broken.

{code}
In [1]: sql.sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col',
'long_col').first()
Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))

In [2]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first()
Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, str_col=u'Hello!', int_col={u'__class__':
u'scala.runtime.BoxedUnit'}, date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo
'America/Toronto' EDT-1 day, 19:00:00 DST>))
{code}

Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being returned for the {{int_col}}
and {{long_col}} columns in the second loop above. This only happens if I select the {{date_col}}
which is stored as {{INT96}}. 

I don't know much about Scala boxing, but I assume that somehow by including numeric columns
that are bigger than a machine word I trigger some different, slower execution path somewhere
that boxes stuff and causes this problem.

If anyone could give me any pointers on where to get started fixing this I'd be happy to dive
in!


> Broken data returned to PySpark dataframe if any large numbers used in Scala land
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-6917
>                 URL: https://issues.apache.org/jira/browse/SPARK-6917
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.3.0
>         Environment: Spark 1.3, Python 2.7.6, Scala 2.10
>            Reporter: Harry Brundage
>         Attachments: part-r-00001.parquet
>
>
> When trying to access data stored in a Parquet file with an INT96 column (read: TimestampType()
encoded for Impala), if the INT96 column is included in the fetched data, other, smaller numeric
types come back broken.
> {code}
> In [1]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").select('int_col',
'long_col').first()
> Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10'))
> In [2]: sql.parquetFile("/Users/hornairs/Downloads/part-r-00001.parquet").first()
> Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, str_col=u'Hello!', int_col={u'__class__':
u'scala.runtime.BoxedUnit'}, date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=<DstTzInfo
'America/Toronto' EDT-1 day, 19:00:00 DST>))
> {code}
> Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being returned for the
{{int_col}} and {{long_col}} columns in the second loop above. This only happens if I select
the {{date_col}} which is stored as {{INT96}}. 
> I don't know much about Scala boxing, but I assume that somehow by including numeric
columns that are bigger than a machine word I trigger some different, slower execution path
somewhere that boxes stuff and causes this problem.
> If anyone could give me any pointers on where to get started fixing this I'd be happy
to dive in!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message