flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Reading multiple datasets with one read operation
Date Thu, 22 Oct 2015 09:48:24 GMT
Hi Pieter,

at the moment there is no support to partition a `DataSet` into multiple
sub sets with one pass over it. If you really want to have distinct data
sets for each path, then you have to filter, afaik.

Cheers,
Till

On Thu, Oct 22, 2015 at 11:38 AM, Pieter Hameete <phameete@gmail.com> wrote:

> Good morning!
>
> I have the following usecase:
>
> My program reads nested data (in this specific case XML) based on
> projections (path expressions) of this data. Often multiple paths are
> projected onto the same input. I would like each path to result in its own
> dataset.
>
> Is it possible to generate more than 1 dataset using a readFile operation
> to prevent reading the input twice?
>
> I have thought about a workaround where the InputFormat would return
> Tuple2s and the first field is the name of the dataset to which a record
> belongs. This would however require me to filter the read data once for
> each dataset or to do a groupReduce which is some overhead i'm looking to
> prevent.
>
> Is there a better (less overhead) workaround for doing this? Or is there
> some mechanism in Flink that would allow me to do this?
>
> Cheers!
>
> - Pieter
>

Mime
View raw message