flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Left join with unbalanced dataset
Date Sun, 31 Jan 2016 19:57:17 GMT
Hi!

YARN killing the application seems strange. The memory use that YARN sees
should not change even when one node gets a lot or data.

Can you share what version of Flink (plus commit hash) you are using and
whether you use off-heap memory or not?

Thanks,
Stephan


On Sun, Jan 31, 2016 at 10:47 AM, Till Rohrmann <trohrmann@apache.org>
wrote:

> Hi Arnaud,
>
> the unmatched elements of A will only end up on the same worker node if
> they all share the same key. Otherwise, they will be evenly spread out
> across your cluster. However, I would also recommend you to use Flink's
> leftOuterJoin.
>
> Cheers,
> Till
>
> On Sun, Jan 31, 2016 at 5:27 AM, Chiwan Park <chiwanpark@apache.org>
> wrote:
>
>> Hi Arnaud,
>>
>> To join two datasets, the community recommends using join operation
>> rather than cogroup operation. For left join, you can use leftOuterJoin
>> method. Flink’s optimizer decides distributed join execution strategy using
>> some statistics of the datasets such as size of the dataset. Additionally,
>> you can set join hint to help optimizer decide the strategy.
>>
>> In transformations section [1] of Flink documentation, you can find about
>> outer join operation in detail.
>>
>> I hope this helps.
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations
>>
>> Regards,
>> Chiwan Park
>>
>> > On Jan 30, 2016, at 6:43 PM, LINZ, Arnaud <ALINZ@bouyguestelecom.fr>
>> wrote:
>> >
>> > Hello,
>> >
>> > I have a very big dataset A to left join with a dataset B that is half
>> its size. That is to say, half of A records will be matched with one record
>> of B, and the other half with null values.
>> >
>> > I used a CoGroup for that, but my batch fails because yarn kills the
>> container due to memory problems.
>> >
>> > I guess that’s because one worker will get half of A dataset (the
>> unmatched ones), and that’s too much for a single JVM
>> >
>> > Am I right in my diagnostic ? Is there a better way to left join
>> unbalanced datasets ?
>> >
>> > Best regards,
>> >
>> > Arnaud
>> >
>> >
>> >
>> > L'intégrité de ce message n'étant pas assurée sur internet, la société
>> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
>> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
>> vous n'êtes pas destinataire de ce message, merci de le détruire et
>> d'avertir l'expéditeur.
>> >
>> > The integrity of this message cannot be guaranteed on the Internet. The
>> company that sent this message cannot therefore be held liable for its
>> content nor attachments. Any unauthorized use or dissemination is
>> prohibited. If you are not the intended recipient of this message, then
>> please delete it and notify the sender.
>>
>>
>

Mime
View raw message