spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ignacio Gómez (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SPARK-25996) Agregaciones no retornan los valores correctos con rows con timestamps iguales
Date Fri, 09 Nov 2018 21:38:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ignacio Gómez updated SPARK-25996:
----------------------------------
    Description: 
Hi all,

When using pyspark I perform a count operation prior to the previous date of the current row,
including in the count the current row, with the corresponding query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as total_count
 from df3
 """
 df3 = sqlContext.sql(query)

and return the following:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

 

As you can see, in the third row, the total_count should give 3 instead of 2 because there
are 2 previous records and not 1. In the following rows, the error is dragged.
This happens with the other aggregation operations.

Beyond the fact that the date of the first rows is the same, that does not mean that these
two exist and should not be considered as the only one that exists is the last one with the
same date.

 

Could you help me?

Thank you

  was:
Qué, tal?

Al utilizar pyspark realizo una operación de conteo de registros previos a la fecha anterior
de la row actual, incluyendo en el conteo la row actual, con la correspondiente query:

query = """
 select *,* count ( * ) over (partition by ACCOUNTID
 order by TS
 range between interval 5000 milliseconds preceding and current row) as total_count
 from df3
 """
 df3 = sqlContext.sql(query)

y retorna lo siguiente:

 
|ACCOUNTID|AMOUNT|TS|total_count|
|1|100|2018-01-01 00:00:01|1|
|1|1000|2018-01-01 10:00:01|1|
|1|25|2018-01-01 10:00:02|2|
|1|500|2018-01-01 10:00:03|3|
|1|100|2018-01-01 10:00:04|4|
|1|80|2018-01-01 10:00:05|5|
|1|700|2018-01-01 11:00:04|1|
|1|205|2018-01-02 10:00:02|1|
|1|500|2018-01-02 10:00:03|2|
|3|80|2018-01-02 10:00:05|1|

Como se puede apreciar, en la tercera row, el total_count debería dar 3 en vez de 2 porque
existen 2 registros previos y no 1. En las rows siguientes, se arrastra el error.
 Esto ocurre con las demás operaciones de agregación.

Más allá de que la fecha de las primeras rows sea la misma, eso no quita que estas dos existan
y no debería considerarse como que la única que existe es la última que tenga la misma
fecha.

 

Me podrían ayudar?

Muchas gracias


> Agregaciones no retornan los valores correctos con rows con timestamps iguales
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-25996
>                 URL: https://issues.apache.org/jira/browse/SPARK-25996
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1, 2.4.0
>         Environment: Windows 10
> PyCharm 2018.2.2
> Python 3.6
>  
>            Reporter: Ignacio Gómez
>            Priority: Major
>
> Hi all,
> When using pyspark I perform a count operation prior to the previous date of the current
row, including in the count the current row, with the corresponding query:
> query = """
>  select *,* count ( * ) over (partition by ACCOUNTID
>  order by TS
>  range between interval 5000 milliseconds preceding and current row) as total_count
>  from df3
>  """
>  df3 = sqlContext.sql(query)
> and return the following:
>  
> |ACCOUNTID|AMOUNT|TS|total_count|
> |1|100|2018-01-01 00:00:01|1|
> |1|1000|2018-01-01 10:00:01|1|
> |1|25|2018-01-01 10:00:02|2|
> |1|500|2018-01-01 10:00:03|3|
> |1|100|2018-01-01 10:00:04|4|
> |1|80|2018-01-01 10:00:05|5|
> |1|700|2018-01-01 11:00:04|1|
> |1|205|2018-01-02 10:00:02|1|
> |1|500|2018-01-02 10:00:03|2|
> |3|80|2018-01-02 10:00:05|1|
>  
> As you can see, in the third row, the total_count should give 3 instead of 2 because
there are 2 previous records and not 1. In the following rows, the error is dragged.
> This happens with the other aggregation operations.
> Beyond the fact that the date of the first rows is the same, that does not mean that
these two exist and should not be considered as the only one that exists is the last one with
the same date.
>  
> Could you help me?
> Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message