flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Bigpetstore - Flink integration
Date Wed, 02 Sep 2015 13:33:58 GMT
Okay, I see.

As I said before, I was not able to reproduce the serialization issue
you've reported.
Can you maybe post the exception you are seeing?

On Wed, Sep 2, 2015 at 3:32 PM, jay vyas <jayunit100.apache@gmail.com>

> Hey, thanks!
> Those are just seeds, the files aren't large.
> The scale out data is the transactions.
> The seed data needs to be the same, shipped to ALL nodes, and then
> the nodes generate transactions.
> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger <rmetzger@apache.org>
> wrote:
>> I'm starting a new discussion thread for the bigpetstore-flink
>> integration ...
>> I took a closer look into the code you've posted.
>> It seems to me that you are generating a lot of data locally on the
>> client, before you actually submit a job to Flink. (Both "customers" and
>> "stores" are generated locally)
>> Is that only some "seed" data?
>> I would actually try to generate as much data as possible in the cluster,
>> making the generator very scalable.
>> I don't think that you need to register a Kryo serializer for the Product
>> and Transaction type.
>> I was able to run the code without the serializer registration.
>> ---------- Forwarded message ----------
>> From: jay vyas <jayunit100.apache@gmail.com>
>> Date: Wed, Sep 2, 2015 at 2:56 PM
>> Subject: Re: Hardware requirements and learning resources
>> To: user@flink.apache.org
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <rmetzger@apache.org>
>> wrote:
>>> Hi Juan,
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <ktzoumas@apache.org>
>>> wrote:
>>>> Hi Juan,
>>>> Flink is quite nimble with hardware requirements; people have run it in
>>>> old-ish laptops and also the largest instances available in cloud
>>>> providers. I will let others chime in with more details.
>>>> I am not aware of something along the lines of a cheatsheet that you
>>>> mention. If you actually try to do this, I would love to see it, and it
>>>> might be useful to others as well. Both use similar abstractions at the API
>>>> level (i.e., parallel collections), so if you stay true to the functional
>>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>>> internals things should be straightforward. These apply to the batch APIs;
>>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>>> an unbounded stream of records and operators on these streams.
>>>> Funny that you ask about a video for the DataStream slides. There is a
>>>> Flink training happening as we speak, and a video is being recorded right
>>>> now :-) Hopefully it will be made available soon.
>>>> Best,
>>>> Kostas
>>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>>> juan.rodriguez.hortala@gmail.com> wrote:
>>>>> Answering to myself, I have found some nice training material at
>>>>> http://dataartisans.github.io/flink-training. There are even videos
>>>>> at youtube for some of the slides
>>>>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>>>>>     https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>>>>   -
>>>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>>>>>     https://www.youtube.com/watch?v=0EARqW15dDk
>>>>> The third lecture
>>>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>>>>> more or less corresponds to
>>>>> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
>>>>> there are more lessons at http://dataartisans.github.io/flink-training,
>>>>> for stream processing and the table API for which I haven't found a
>>>>> video. Does anyone have pointers to the missing videos?
>>>>> Greetings,
>>>>> Juan
>>>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>>>>> juan.rodriguez.hortala@gmail.com>:
>>>>>> Hi list,
>>>>>> I'm new to Flink, and I find this project very interesting. I have
>>>>>> experience with Apache Spark, and for I've seen so far I find that
>>>>>> provides an API at a similar abstraction level but based on single
>>>>>> processing instead of batch processing. I've read in Quora that Flink
>>>>>> extends stream processing to batch processing, while Spark extends
>>>>>> processing to streaming. Therefore I find Flink specially attractive
>>>>>> low latency stream processing. Anyway, I would appreciate if someone
>>>>>> give some indication about where I could find a list of hardware
>>>>>> requirements for the slave nodes in a Flink cluster. Something along
>>>>>> lines of
>>>>>> https://spark.apache.org/docs/latest/hardware-provisioning.html.
>>>>>> Spark is known for having quite high minimal memory requirements
>>>>>> and 8 cores minimum), and I was wondering if it is also the case
for Flink.
>>>>>> Lower memory requirements would be very interesting for building
>>>>>> Flink clusters for educational purposes, or for small projects.
>>>>>> Apart from that, I wonder if there is some blog post by the comunity
>>>>>> about transitioning from Spark to Flink. I think it could be interesting,
>>>>>> as there are some similarities in the APIs, but also deep differences
>>>>>> the underlying approaches. I was thinking in something like Breeze's
>>>>>> cheatsheet comparing its matrix operatations with those available
in Matlab
>>>>>> and Numpy
>>>>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet,
>>>>>> or like http://rosettacode.org/wiki/Factorial. Just an idea anyway.
>>>>>> Also, any pointer to some online course, book or training for Flink
>>>>>> the official programming guides would be much appreciated
>>>>>> Thanks in advance for help
>>>>>> Greetings,
>>>>>> Juan
>> --
>> jay vyas
> --
> jay vyas

View raw message