spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciej Szymkiewicz (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-10467) Vector is converted to tuple when extracted from Row using __getitem__
Date Mon, 07 Sep 2015 02:36:45 GMT

     [ https://issues.apache.org/jira/browse/SPARK-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Maciej Szymkiewicz updated SPARK-10467:
---------------------------------------
    Description: 
If we take a row from a data frame and try to extract vector element by index it is converted
to tuple:

{code}
from pyspark.ml.feature import HashingTF

df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", ))
transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5)
transformed = transformer.transform(df)
row = transformed.first()

row.vec # As expected
## SparseVector(5, {4: 2.0})

row[1]  # Returns tuple
## (0, 5, [4], [2.0]) 
{code}

Problem cannot be reproduced if we create and access Row directly:

{code}
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row

row = Row(vec=Vectors.sparse(3, [(0, 1)]))

row.vec
## SparseVector(3, {0: 1.0})

row[0]
## SparseVector(3, {0: 1.0})
{code}

but if use data frame on above

{code}
df = sqlContext.createDataFrame([row], ("vec", ))

df.first()[0]
## (0, 3, [0], [1.0])  
{code}

  was:
If we take a row from a data frame and try to extract vector element by index it is converted
to tuple:

{code}
from pyspark.ml.feature import HashingTF

df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", ))
transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5)
transformed = transformer.transform(df)
row = transformed.first()

row.vec # As expected
## SparseVector(5, {4: 2.0})

row[1]  # Returns tuple
## (0, 5, [4], [2.0]) 
{code}

Problem cannot be reproduced if we create and access Row directly:

{code}
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row

row = Row(vec=Vectors.sparse(3, [(0, 1)]))

row.vec
## SparseVector(3, {0: 1.0})

row[0]
## SparseVector(3, {0: 1.0})
{code}

but if use data frame on above

{code}

df = sqlContext.createDataFrame([row], ("vec", ))

df.first()[0]
## (0, 3, [0], [1.0])  
{code}


> Vector is converted to tuple when extracted from Row using __getitem__
> ----------------------------------------------------------------------
>
>                 Key: SPARK-10467
>                 URL: https://issues.apache.org/jira/browse/SPARK-10467
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark, SQL
>    Affects Versions: 1.4.1
>            Reporter: Maciej Szymkiewicz
>            Priority: Minor
>
> If we take a row from a data frame and try to extract vector element by index it is converted
to tuple:
> {code}
> from pyspark.ml.feature import HashingTF
> df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", ))
> transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5)
> transformed = transformer.transform(df)
> row = transformed.first()
> row.vec # As expected
> ## SparseVector(5, {4: 2.0})
> row[1]  # Returns tuple
> ## (0, 5, [4], [2.0]) 
> {code}
> Problem cannot be reproduced if we create and access Row directly:
> {code}
> from pyspark.mllib.linalg import Vectors
> from pyspark.sql.types import Row
> row = Row(vec=Vectors.sparse(3, [(0, 1)]))
> row.vec
> ## SparseVector(3, {0: 1.0})
> row[0]
> ## SparseVector(3, {0: 1.0})
> {code}
> but if use data frame on above
> {code}
> df = sqlContext.createDataFrame([row], ("vec", ))
> df.first()[0]
> ## (0, 3, [0], [1.0])  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message