flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: KMeans job gets stuck and never completes
Date Sun, 22 Jun 2014 13:14:50 GMT
Thank you. What degree of parallelism are you using when submitting the job?
You can either set it with the "-p" argument or as
env.setDegreeOfParalleism().
How much heapspace do you assign to the TaskManagers?



On Sun, Jun 22, 2014 at 3:07 PM, José Luis López Pino <jllopezpino@gmail.com
> wrote:

> Hi,
>
> I'm using two instances of a VPS and using as this input for the program:
> - Iterations: 2
> - Dimensions: 2 (3 for the scala example program)
> - Number of centers (k): 10
>
> This is my current configuration for network buffers (i think they are
> values by default):
> # Number of network buffers (used by each TaskManager)
> taskmanager.network.numberOfBuffers: 2048
> # Size of network buffers
> taskmanager.network.bufferSizeInBytes: 32768
>
> Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement,
> Pino
>
>
> On 22 June 2014 14:19, Robert Metzger <rmetzger@apache.org> wrote:
>
>> Workers waiting in "LocalBufferPool.requestBuffer()" is usually a sign for
>> a distributed deadlock.
>> Can you send me some instructions on how to get the same input data you
>> have (download url? generator settings?) and what configuration parameters
>> you are using (max iteration limit, k, ?) when calling the K-Means
>> example.
>> I would like to try it on our cluster.
>>
>> Just out of curiosity, what hardware are you using? Is it the IBM Power
>> cluster at TU Berlin?
>>
>> Robert
>>
>>
>> On Sun, Jun 22, 2014 at 1:53 PM, Sebastian Schelter <
>> ssc.open@googlemail.com
>> > wrote:
>>
>> > You could try to increase the number of buffers available to the network
>> > stack. That solved similar problems for me in the past.
>> >
>> > -s
>> > Am 22.06.2014 13:48 schrieb "José Luis López Pino" <
>> jllopezpino@gmail.com
>> > >:
>> >
>> > > It seems like the thread reading the points file is locked waiting
>> for a
>> > > buffer from the global buffer pool that doesn't come. What could be
>> > causing
>> > > this?
>> > >
>> > >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> > >  at java.lang.Object.wait(Native Method)
>> > > - waiting on <0x6b985888> (a java.util.ArrayDeque)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBuffer(LocalBufferPool.java:160)
>> > >  - locked <0x6b985888> (a java.util.ArrayDeque)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:101)
>> > >  at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.gates.InputGate.requestBufferBlocking(InputGate.java:333)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.channels.InputChannel.requestBufferBlocking(InputChannel.java:426)
>> > >  at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.network.ChannelManager.dispatchFromOutputChannel(ChannelManager.java:441)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.channels.OutputChannel.sendBuffer(OutputChannel.java:74)
>> > >  at
>> > >
>> >
>> eu.stratosphere.runtime.io.gates.OutputGate.sendBuffer(OutputGate.java:49)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.runtime.io.api.BufferWriter.sendBuffer(BufferWriter.java:35)
>> > >  at
>> > eu.stratosphere.runtime.io.api.RecordWriter.emit(RecordWriter.java:96)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.pact.runtime.shipping.OutputCollector.collect(OutputCollector.java:82)
>> > >  at
>> > >
>> > >
>> >
>> eu.stratosphere.pact.runtime.task.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:71)
>> > > at
>> > >
>> > >
>> >
>> eu.stratosphere.pact.runtime.task.DataSourceTask.invoke(DataSourceTask.java:228)
>> > >  at
>> > >
>> > >
>> >
>> eu.stratosphere.nephele.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:284)
>> > > at java.lang.Thread.run(Thread.java:744)
>> > >
>> > >
>> > > Thanks for your help Sebastian.
>> > >
>> > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement,
>> > > Pino
>> > >
>> > >
>> > > On 22 June 2014 13:38, Sebastian Schelter <ssc.open@googlemail.com>
>> > wrote:
>> > >
>> > > > Have you looked at a jstack dump on one of the workera? That
>> typically
>> > > > helps finding out, where the processes are stuck.
>> > > >
>> > > > -s
>> > > > Am 22.06.2014 13:32 schrieb "José Luis López Pino" <
>> > > jllopezpino@gmail.com
>> > > > >:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > I'm running the KMeans java and scala examples in two nodes.
It
>> works
>> > > > fine
>> > > > > with very small files (3MB) but when I try with files of 30MB
or
>> > bigger
>> > > > the
>> > > > > process never ends. After several hours, the DataChain process
>> that
>> > is
>> > > > > reading the input points is still working.
>> > > > >
>> > > > > I have tried before with way bigger files in the same environment
>> > and I
>> > > > had
>> > > > > no issue. I have already tried:
>> > > > > - Check that the process is not locked using all the CPU time.
>> > > > > - Format the datanodes.
>> > > > > - Compile the last version available on github.
>> > > > > - The debug log mode doesn't give any additional information.
>> > > > >
>> > > > > Could someone give me a hint where to look at that? Thanks for
>> your
>> > > help!
>> > > > >
>> > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien
>> cordialement,
>> > > > > Pino
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message