spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Distinct on Map data type -- SPARK-19893
Date Sat, 13 Jan 2018 03:36:09 GMT
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
result...
We are still working on adding this feature, but before that, we should
fail earlier instead of returning wrong result.

On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ckhari4u@gmail.com> wrote:

> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
> see
> a clear justification for why SPARK 19893 is important and needed. I have a
> sample table which works fine with an earlier build of Spark 2.1.0. Now
> that
> the latest build is having the backport of SPARK-19893, its failing with
> error:
>
> Error in query: Cannot have map type columns in DataFrame which calls set
> operations(intersect, except, etc.), but the type of column metrics is
> map<string,int>;;
> Distinct
>
>
> *In Old Build of Spark 2.1.0, I tried the below:*
>
>
> create TABLE map_demo2
> (
> country_id BIGINT,
> metrics MAP <STRING, int>
> );
>
> insert into table map_demo2 select 2,map("chaka",102) ;
> insert into table map_demo2 select 3,map("chaka",102) ;
> insert into table map_demo2 select 4,map("mangaa",103) ;
>
>
> spark-sql> select distinct metrics from map_demo2;
> [Stage 0:>                                                          (0 + 4)
> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
> create the Initialization Vector used by CryptoStream
> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
> create the Initialization Vector used by CryptoStream
> [Stage 1:===============================>                       (1[Stage
> 1:===========================================>           (1[Stage
> 1:======================================================>(1
> {"mangaa":103}
> {"chaka":102}
> {"chaka":103}
> Time taken: 15.331 seconds, Fetched 3 row(s)
>
> Here the simple distinct query works fine in Spark. Any thoughts why
> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
> From the PR, it says,
> // TODO: although map type is not orderable, technically map type should be
> able to be
>  +          // used inequality comparison, remove this type check once we
> support it.
>
> Could not figure out the issue caused by using the aforementioned
> operators?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message