spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhaP524 (JIRA)" <>
Subject [jira] [Closed] (SPARK-21948) How to use spark streaming for deal two table from one topic of topic??
Date Fri, 08 Sep 2017 03:24:00 GMT


zhaP524 closed SPARK-21948.

> How to use spark streaming for deal two table from one topic of topic??
> -----------------------------------------------------------------------
>                 Key: SPARK-21948
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Submit, Structured Streaming
>    Affects Versions: 2.1.1
>         Environment: kafka:0.10.0
> Spark:2.1.1
>            Reporter: zhaP524
>         Attachments: QQ图片20170908080946.png
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> Now, I have such A requirement, I want from A topic of kafka、 A group of receiving
data, receive to DirectStream contains two tables of data (A<Master>, B<Slave>),
my ultimate goal is the two tables of data according to some fields join operation, will produce
the results into the Database;
> My previous operations are as follows:
> 1. Use batch processing directly at the DirectStream layer, filter out the data of A
and B, produce their respective DF, and then use Spark SQL to perform the join operation,
and the results are then entered into the library;But this kind of situation will exist problems,
A and B table 1: N relationship, when selected A batch, may lead to A full load, and part
B table loaded only, lead to the calculation results is only part of the next batch came in
already consumed A data, table B subsequent data has no associated data, leading to loss of
data;I don't know if you can store the A table data using the queue and so on, until the total
load of B table data is loaded and processed to generate the complete result. In this case,
the essence is similar to Spark Streaming Window Operation
> 2、
> 2, the use of Spark Streaming Window Operation for processing, I can isolate A and B
table when DirectStream flow stream, but the join Operation shall be carried out in the Window,
the Window Operation did not support the transform Operation such as map, lead to process
cannot go through( org.apache.kafka.clients.consumer.ConsumerRecord
> Serialization stack:
> 	- object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord,
value: ConsumerRecord(topic = doc, partition = 0,...), 
> I don't know what should I do??
> So, I was wondering if there was a problem with my scene?Or do I have a technical problem?Is
there a solution to my business scenario without introducing other components?Trouble to give

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message