flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Neumann <mneum...@spotify.com>
Subject Re: how load/group with large csv files
Date Tue, 21 Oct 2014 13:02:09 GMT
I will go with that workaround, however I would have preferred if I could
have done that directly with the API instead of doing Map/Reduce like
Key/Value tuples again :-)

By the way is there a simple function to count the number of items in a
reduce group? It feels stupid to write a GroupReduce that just iterates and
increments a counter.

cheers Martin

On Tue, Oct 21, 2014 at 2:54 PM, Robert Metzger <rmetzger@apache.org> wrote:

> Yes, for sorted groups, you need to use Pojos or Tuples.
> I think you have to split the input lines manually, with a mapper.
> How about using a TupleN<...> with only the fields you need? (returned by
> the mapper)
>
> if you need all fields, you could also use a Tuple2<String, String[]> where
> the first position is the sort key?
>
>
>
> On Tue, Oct 21, 2014 at 2:20 PM, Gyula Fora <gyfora@apache.org> wrote:
>
> > I am not sure how you should go about that, let’s wait for some feedback
> > from the others.
> >
> > Until then you can always map the array to (array, keyfield) and use
> > groupBy(1).
> >
> >
> > > On 21 Oct 2014, at 14:17, Martin Neumann <mneumann@spotify.com> wrote:
> > >
> > > Hej,
> > >
> > > Unfortunately .sort() cannot take a key extractor, would I have to do
> the
> > > sort myself then?
> > >
> > > cheers Martin
> > >
> > > On Tue, Oct 21, 2014 at 2:08 PM, Gyula Fora <gyfora@apache.org> wrote:
> > >
> > >> Hey,
> > >>
> > >> Using arrays is probably a convenient way to do so.
> > >>
> > >> I think the way you described the groupBy only works for tuples now.
> To
> > do
> > >> the grouping on the array field, you would need to create a key
> > extractor
> > >> for this and pass that to groupBy.
> > >>
> > >> Actually we have some use-cases like this for streaming so we are
> > thinking
> > >> of writing a wrapper for the array types that would behave as you
> > described.
> > >>
> > >> Regards,
> > >> Gyula
> > >>
> > >>> On 21 Oct 2014, at 14:03, Martin Neumann <mneumann@spotify.com>
> wrote:
> > >>>
> > >>> Hej,
> > >>>
> > >>> I have a csv file with 54 columns each of them is string (for now).
I
> > >> need
> > >>> to group and sort them on field 15.
> > >>>
> > >>> Whats the best way to load the data into Flink?
> > >>> There is no Tuple54 (and the <> would look awful anyway with
54 times
> > >>> String in it).
> > >>> My current Idea is to write a Mapper and split the string to Arrays
> of
> > >>> Strings would grouping and sorting work on this?
> > >>>
> > >>> So can I do something like this or does that only work on tuples:
> > >>> Dataset<String[]> ds;
> > >>> ds.groupBy(15).sort(20. ANY)
> > >>>
> > >>> cheers Martin
> > >>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message