flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Pasierbinski <Krzysztof.Pasierbin...@dfki.de>
Subject AW: Cluster execution of an example program ("Word count") and a problem related to the modificated example
Date Sun, 29 Jun 2014 14:26:57 GMT
Hi Fabian,
I have copied the input file to the second node and it worked. Indeed, I have to use HDFS.
Although, I don't still understand why in this case size (of the input file) matters.
Once more time, thank you a lot for your help!

-----Ursprüngliche Nachricht-----
Von: Krzysztof Pasierbinski [mailto:Krzysztof.Pasierbinski@dfki.de] 
Gesendet: Sonntag, 29. Juni 2014 16:17
An: dev@flink.incubator.apache.org
Betreff: AW: Cluster execution of an example program ("Word count") and a problem related
to the modificated example

Hi Fabian,
thank you for the explanation. I was conscious, that there is no input file on the worker
node and for small files it worked fine. I assumed, that the part of the input file is replicated
automatically. The result path is available on both machines as I set up the same file system
on each node. I see that I have to use HDFS for my use case.

-----Ursprüngliche Nachricht-----
Von: Fabian Hueske [mailto:fhueske@gmail.com]
Gesendet: Sonntag, 29. Juni 2014 16:05
An: dev@flink.incubator.apache.org
Betreff: Re: Cluster execution of an example program ("Word count") and a problem related
to the modificated example

Hi Krzysztof,

reading and writing data from the local file system in a distributed setup is always a bit
tricky.
For Flink, the input files must be available in the local file system of each work node (and
I think also on the master node), i.e., the data needs to be copied (replicated) to each machine
or the directory must be shared for example via NFS (note, using shared directory might cause
very bad IO performance).
For output files, the result path must be available on each machine.

As Aljoscha said, the preferred way to go is to use a distributed filesystem (HDFS) in distributed
scenario.
Nonetheless, using the local FS in distributed setup should work.

Have you checked if your input data
(/home/krzysztof/stratosphere05/generatedFrequencies.txt) is available on each machine (workers
+ master)?

Cheers, Fabian



2014-06-29 15:06 GMT+02:00 Krzysztof Pasierbinski <
Krzysztof.Pasierbinski@dfki.de>:

> Hi all,
> thank you all for prompt replies. It is great to know, that there is 
> so strong community support. Yes indeed, I don't use Hadoop yet. I 
> wanted to try out Flink framework and then integrate it with Hadoop. I 
> have read somewhere that Hadoop is not obligatory.
> I wonder, why the same program with the same configuration works fine 
> for small files and this error appears only for the bigger ones. The 
> example program "Word count" works always fine, so I suppose that 
> there is my mistake somewhere behind.
>
>
> -----Ursprüngliche Nachricht-----
> Von: Aljoscha Krettek [mailto:aljoscha@apache.org]
> Gesendet: Sonntag, 29. Juni 2014 09:24
> An: dev@flink.incubator.apache.org
> Betreff: Re: Cluster execution of an example program ("Word count") 
> and a problem related to the modificated example
>
> Hi Krzysztof,
> for the file acces problem: From the path it looks like you are 
> accessing them as local files rather than as files in a distributed 
> file system (HDFS is the default here). So one of the nodes can access 
> the file because it is actually on the machine where the code is 
> running while the other code executes on a machine where the file is 
> not available. This explains how to setup hadoop with HDFS:
> http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html . You only 
> need to start HDFS, though,  with "bin/start-dfs.sh". For accessing 
> files inside HDFS from flink you would use a path such as "hdfs:///foo/bar"
>
> Please write again if you need more help.
>
> Aljoscha
>
>
> On Sat, Jun 28, 2014 at 10:57 PM, Ufuk Celebi <u.celebi@fu-berlin.de>
> wrote:
>
> >
> > > On 28 Jun 2014, at 22:52, Stephan Ewen <sewen@apache.org> wrote:
> > >
> > > Hey!
> > >
> > > You can always get the result in a single file, by setting the
> > parallelism
> > > of the sink task to one, for example line 
> > > "result.writeAsText(path).parallelism(1)".
> >
> > Oh sure. I realized this after sending the mail. Thanks for pointing 
> > it out. :)
> >
>
Mime
View raw message