flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Gévay <gga...@gmail.com>
Subject Re: Reading multiple datasets with one read operation
Date Thu, 22 Oct 2015 09:49:12 GMT
Hello!

> I have thought about a workaround where the InputFormat would return
> Tuple2s and the first field is the name of the dataset to which a record
> belongs. This would however require me to filter the read data once for
> each dataset or to do a groupReduce which is some overhead i'm
> looking to prevent.

I think that those two filters might not have that much overhead,
because of several optimizations Flink does under the hood:
- The dataset of Tuple2s won't be materialized, but instead will be
streamed directly to the two filter operators.
- The input format and the two filters will probably end up on the
same machine, because of chaining, so there won't be
serialization/deserialization between them.

Best,
Gabor



2015-10-22 11:38 GMT+02:00 Pieter Hameete <phameete@gmail.com>:
> Good morning!
>
> I have the following usecase:
>
> My program reads nested data (in this specific case XML) based on
> projections (path expressions) of this data. Often multiple paths are
> projected onto the same input. I would like each path to result in its own
> dataset.
>
> Is it possible to generate more than 1 dataset using a readFile operation to
> prevent reading the input twice?
>
> I have thought about a workaround where the InputFormat would return Tuple2s
> and the first field is the name of the dataset to which a record belongs.
> This would however require me to filter the read data once for each dataset
> or to do a groupReduce which is some overhead i'm looking to prevent.
>
> Is there a better (less overhead) workaround for doing this? Or is there
> some mechanism in Flink that would allow me to do this?
>
> Cheers!
>
> - Pieter

Mime
View raw message