beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismaël Mejía <ieme...@gmail.com>
Subject Re: Issue with GroupByKey in BeamSql using SparkRunner
Date Wed, 10 Oct 2018 09:16:49 GMT
Are you trying this in a particular spark distribution or just locally ?
I ask this because there was a data corruption issue with Spark 2.3.1
(previous version used by Beam)
https://issues.apache.org/jira/browse/SPARK-23243

Current Beam master (and next release) moves Spark to version 2.3.2
and that should fix some of the data correctness issues (maybe yours
too).
Can you give it a try and report back if this fixes your issue.


On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm <bmvishwas@gmail.com> wrote:
>
> Hi Kenn,
>
> We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2 cluster on Kubernetes.
>
>
> On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles <kenn@apache.org> wrote:
>>
>> Thanks for the report! I filed https://issues.apache.org/jira/browse/BEAM-5690 to
track the issue.
>>
>> Can you share what version of Beam you are using?
>>
>> Kenn
>>
>> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm <bmvishwas@gmail.com> wrote:
>>>
>>> We are trying to setup a pipeline with using BeamSql and the trigger used is
default (AfterWatermark crosses the window).
>>> Below is the pipeline:
>>>
>>>    KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql
---> KafkaSink (KafkaIO)
>>>
>>> We are using Spark Runner for this.
>>> The BeamSql query is:
>>>              select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY Col3
>>>
>>> We are grouping by Col3 which is a string. It can hold values string[0-9].
>>>
>>> The records are getting emitted out at 1 min to kafka sink, but the output record
in kafka is not as expected.
>>> Below is the output observed: (WST and WET are indicators for window start time
and window end time)
>>>
>>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00 0000  +0000","WET":"2018-10-09
 09-56-00 0000  +0000"}
>>>
>>> We ran the same pipeline using direct and flink runner and we dont see 0 entries
for count_col1.
>>>
>>> As per beam matrix page (https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
GroupBy is not fully supported,is this one of those cases ?
>>> Thanks & Regards,
>>> Vishwas
>>>

Mime
View raw message