Mailing-List: contact user-help@flink.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of s.bortoli@gmail.com designates
 74.125.82.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANC1h_svFeQyQE0ZPAzXaZgSeo86B6Zi518A_iPgmNb7TKREGw@mail.gmail.com>
References: 
 <CAN0XJzNiTyWDfcDLhsP6iJVhpUgnYn0ACy4ueS2R6YSB68Fr=A@mail.gmail.com>
 <CAKiyyaG=3sfP_bxeUPRYruqSu3ZsZM9RmScn_BuEOp_bp8+VsQ@mail.gmail.com>
 <CAGr9p8CsUNK_EgPCFwBy+a6r-d5G06rmfe6M3bEYZHzAdO9+Aw@mail.gmail.com>
 <CANC1h_uc2WQ2NF0_R_GX9=pcweOexT=iPsB=Gg1DDMR1bJoZAQ@mail.gmail.com>
 <CAKiyyaFW5DKWyoUs_CcyFbQx063Rt6959JXhV8B-vbsmJ66rWA@mail.gmail.com>
 <CAELUF_Drgn_AeR9PGM2_oj6SQi_GE9YQME9V8UcKGATGNN5O7g@mail.gmail.com>
 <CAN0XJzPj+VGc+0zrcfAnyJcq1wpBPfFddj4D7qpDxy=aFY-YnQ@mail.gmail.com>
 <CAN0XJzOFUyNFcFjo_op1Bi5SpWfA5w3=Kheu1K8qgwKsx2aFbw@mail.gmail.com>
 <CANC1h_svFeQyQE0ZPAzXaZgSeo86B6Zi518A_iPgmNb7TKREGw@mail.gmail.com>
From: Stefano Bortoli <s.bortoli@gmail.com>
Date: Thu, 4 Dec 2014 12:48:53 +0100
Message-ID: 
 <CAN0XJzNbaphNawzV8+rcKD24_z_hU+m4=jbsp_3co95OYvLU+Q@mail.gmail.com>
Subject: Re: No Space Left on Device
To: user <user@flink.incubator.apache.org>
Content-Type: multipart/alternative; boundary=e89a8f234c039625ab0509628c2c

--e89a8f234c039625ab0509628c2c
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

JAXB and Serialization are necessary for my business logic. I store data as
byte[] which are plain serialization of XML String. At every read I have to
rebuild the objects using jaxb.

Kryo in Flink will allow to manage more easily user defined objects, I
guess.

saluti,
Stefano


2014-12-04 12:41 GMT+01:00 Stephan Ewen <sewen@apache.org>:

> Hi Stefano!
>
> Good to hear that it is working for you!
>
> Just a heads up: Flink is not using JAXB or any other Java Serialization
> for its data exchange, only to deploy functions into the cluster (which i=
s
> usually very fast). When we send records around, we have a special
> serialization stack that is absolutely competitive with Kryo on
> serialization speed. We are thinking of using Kryo, though, to deploy
> functions into the cluster in the future, to work around some of the
> constraints that the java serialization has.
>
> Greetings,
> Stephan
>
>
> On Thu, Dec 4, 2014 at 8:48 AM, Stefano Bortoli <s.bortoli@gmail.com>
> wrote:
>
>> The process was completed in about 6h45m, much less than the previous
>> one. The longest time is still taken by the 'blocking part'. I guess we
>> could just increase redundancy of SolrCloud indexes, and we could reach
>> amazing performances. Furthermore, we did not apply any 'key
>> transformation' (reversing or generating Long as ID), so we have further
>> margin for improvements. Furthermore, I have the feeling that relying on
>> Kryo serialization to build the POJOs rather than old-school JAXB
>> marshalling/unmarshalling would also give quite a boost as we repeat the
>> operation at least 250M times. :-)
>>
>> Thanks a lot to everyone. Flink is making possible effective
>> deduplication on a very heterogeneous dataset of about 10M entries withi=
n
>> hours in a cluster of 6 cheap hardware nodes. :-)
>>
>> saluti,
>> Stefano
>>
>> 2014-12-03 18:31 GMT+01:00 Stefano Bortoli <s.bortoli@gmail.com>:
>>
>>> Hi all,
>>>
>>> thanks for the feedback. For the moment, I hope I resolved the problem
>>> by compressing the string into a bite[] using a custom implementation o=
f
>>> Value interface and LZ4 algorithm. I have a little overhead on the
>>> processing of some steps, but it should reduce network traffic and requ=
ired
>>> temporary space on disk.
>>>
>>> I think the problem is due to the two joins moving around quite a bit o=
f
>>> data. Essentially I join twice something like 230 million tuples with a
>>> dataset of 9.2 million entries (~80GB). Compression seems to be working
>>> fine so far, even though I did not reach the critical point yet. I'll k=
eep
>>> you posted to let you know whether this workaround solved the problem.
>>>
>>> I applied a double join as an alternative to the repeat 230M*2 single
>>> gets on HBase. Even though this allowed to completed the process in abo=
ut
>>> 11h.
>>>
>>> thanks a lot to everyone again.
>>>
>>> saluti,
>>> Stefano
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2014-12-03 18:02 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>
>>>> I think I can answer on behalf of Stefano that is busy right now..the
>>>> job failed because on the job manager (that is also a task manager) th=
e
>>>> temp folder was full.
>>>> We would like to understand how big should be the temp directory..whic=
h
>>>> parameters should we consider to make that computation?
>>>>
>>>>
>>>> On Wed, Dec 3, 2014 at 5:22 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>>
>>>>> The task managers log the temporary directories at start up. Can you
>>>>> have a look there and verify that you configured the temporary direct=
ories
>>>>> correctly?
>>>>>
>>>>> On Wed, Dec 3, 2014 at 5:17 PM, Stephan Ewen <sewen@apache.org> wrote=
:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> That exception means that one of the directories is full. If you hav=
e
>>>>>> several temp directories on different disks, you can add them all to=
 the
>>>>>> config and the temp files will be rotated across the disks.
>>>>>>
>>>>>> The exception may come once the first temp directory is full. For
>>>>>> example, if you have 4 temp dirs (where 1 is rather full while the o=
thers
>>>>>> have a lot of space), it may be that one temp file on the full direc=
tory
>>>>>> grows large and exceeds the space, while the other directories have =
plenty
>>>>>> of space.
>>>>>>
>>>>>> Greetings,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 3, 2014 at 4:40 PM, Robert Metzger <rmetzger@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think Flink is deleting its temporary files.
>>>>>>>
>>>>>>> Is the temp. path set to the SSD on each machine?
>>>>>>> What is the size of the two data sets your are joining? Your cluste=
r
>>>>>>> has 6*256GB =3D 1.5 TB of temporary disk space.
>>>>>>> Maybe only the temp directory of one node is full?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 3, 2014 at 3:52 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>>>>>
>>>>>>>> Hey Stefano,
>>>>>>>>
>>>>>>>> I would wait for Stephan's take on this, but with caught
>>>>>>>> IOExceptions the hash table should properly clean up after itself =
and
>>>>>>>> delete the file.
>>>>>>>>
>>>>>>>> Can you still reproduce this problem for your use case?
>>>>>>>>
>>>>>>>> =E2=80=93 Ufuk
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 2, 2014 at 7:07 PM, Stefano Bortoli <
>>>>>>>> s.bortoli@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> a quite long process failed due to this No Space Left on Device
>>>>>>>>> exception, but the machine disk is not full at all.
>>>>>>>>>
>>>>>>>>> okkam@okkam-nano-2:/opt/flink-0.8$ df
>>>>>>>>> Filesystem     1K-blocks     Used Available Use% Mounted on
>>>>>>>>> /dev/sdb2      223302236 22819504 189116588  11% /
>>>>>>>>> none                   4        0         4   0% /sys/fs/cgroup
>>>>>>>>> udev             8156864        4   8156860   1% /dev
>>>>>>>>> tmpfs            1633520      524   1632996   1% /run
>>>>>>>>> none                5120        0      5120   0% /run/lock
>>>>>>>>> none             8167584        0   8167584   0% /run/shm
>>>>>>>>> none              102400        0    102400   0% /run/user
>>>>>>>>> /dev/sdb1         523248     3428    519820   1% /boot/efi
>>>>>>>>> /dev/sda1      961302560  2218352 910229748   1% /media/data
>>>>>>>>> cm_processes     8167584    12116   8155468   1%
>>>>>>>>> /run/cloudera-scm-agent/process
>>>>>>>>>
>>>>>>>>> Is it possible that the temporary files were deleted 'after the
>>>>>>>>> problem'? I read so, but there was no confirmation. However, it i=
s a 256SSD
>>>>>>>>> disk. Each of the 6 nodes has it.
>>>>>>>>>
>>>>>>>>> Here is the stack trace:
>>>>>>>>>
>>>>>>>>> 16:37:59,581 ERROR
>>>>>>>>> org.apache.flink.runtime.operators.RegularPactTask            - E=
rror in
>>>>>>>>> task code:  CHAIN Join
>>>>>>>>> (org.okkam.flink.maintenance.deduplication.consolidate.Join2ToGet=
Candidates)
>>>>>>>>> -> Filter
>>>>>>>>> (org.okkam.flink.maintenance.deduplication.match.SingleMatchFilte=
rFunctionWithFlagMatch)
>>>>>>>>> -> Map
>>>>>>>>> (org.okkam.flink.maintenance.deduplication.match.MapToTuple3MapFu=
nction) ->
>>>>>>>>> Combine(org.apache.flink.api.java.operators.DistinctOperator$Dist=
inctFunction)
>>>>>>>>> (4/28)
>>>>>>>>> java.io.IOException: The channel is erroneous.
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.ChannelAccess.checkErr=
oneous(ChannelAccess.java:132)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.BlockChannelWriter.wri=
teBlock(BlockChannelWriter.java:73)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.ChannelWriterOutputVie=
w.writeSegment(ChannelWriterOutputView.java:218)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.ChannelWriterOutputVie=
w.nextSegment(ChannelWriterOutputView.java:204)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.ad=
vance(AbstractPagedOutputView.java:140)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.wr=
iteByte(AbstractPagedOutputView.java:223)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.wr=
ite(AbstractPagedOutputView.java:173)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.types.StringValue.writeString(StringValue.java:8=
08)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.api.common.typeutils.base.StringSerializer.seria=
lize(StringSerializer.java:68)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.api.common.typeutils.base.StringSerializer.seria=
lize(StringSerializer.java:28)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.seria=
lize(TupleSerializer.java:95)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.seria=
lize(TupleSerializer.java:30)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.hash.HashPartition.insertIntoP=
robeBuffer(HashPartition.java:269)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.hash.MutableHashTable.processP=
robeIter(MutableHashTable.java:474)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.hash.MutableHashTable.nextReco=
rd(MutableHashTable.java:537)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.hash.BuildSecondHashMatchItera=
tor.callWithNextKey(BuildSecondHashMatchIterator.java:106)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.ja=
va:148)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPac=
tTask.java:484)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.operators.RegularPactTask.invoke(Regular=
PactTask.java:359)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.execution.RuntimeEnvironment.run(Runtime=
Environment.java:246)
>>>>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>>>>> Caused by: java.io.IOException: No space left on device
>>>>>>>>>     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>     at
>>>>>>>>> sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>>>>>>>>>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>     at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest.wr=
ite(BlockChannelAccess.java:259)
>>>>>>>>>     at
>>>>>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager$WriterThread=
.run(IOManager.java:636)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--e89a8f234c039625ab0509628c2c
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>JAXB and Serialization are necessary for my business =
logic. I store data as byte[] which are plain serialization of XML String. =
At every read I have to rebuild the objects using jaxb. <br><br></div>Kryo =
in Flink will allow to manage more easily user defined objects, I guess. <b=
r><br>saluti,<br>Stefano<br><div><br><br></div></div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">2014-12-04 12:41 GMT+01:00 Stephan Ewen=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank=
">sewen@apache.org</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div d=
ir=3D"ltr">Hi Stefano!<div><br></div><div>Good to hear that it is working f=
or you!</div><div><br></div><div>Just a heads up: Flink is not using JAXB o=
r any other Java Serialization for its data exchange, only to deploy functi=
ons into the cluster (which is usually very fast). When we send records aro=
und, we have a special serialization stack that is absolutely competitive w=
ith Kryo on serialization speed. We are thinking of using Kryo, though, to =
deploy functions into the cluster in the future, to work around some of the=
 constraints that the java serialization has.</div><div><br></div><div>Gree=
tings,</div><div>Stephan</div><div><br></div></div><div class=3D"HOEnZb"><d=
iv class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">O=
n Thu, Dec 4, 2014 at 8:48 AM, Stefano Bortoli <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:s.bortoli@gmail.com" target=3D"_blank">s.bortoli@gmail.com</a>&=
gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><=
div>The process was completed in about 6h45m, much less than the previous o=
ne. The longest time is still taken by the &#39;blocking part&#39;. I guess=
 we could just increase redundancy of SolrCloud indexes, and we could reach=
 amazing performances. Furthermore, we did not apply any &#39;key transform=
ation&#39; (reversing or generating Long as ID), so we have further margin =
for improvements. Furthermore, I have the feeling that relying on Kryo seri=
alization to build the POJOs rather than old-school JAXB marshalling/unmars=
halling would also give quite a boost as we repeat the operation at least 2=
50M times. :-)<br><br></div>Thanks a lot to everyone. Flink is making possi=
ble effective deduplication on a very heterogeneous dataset of about 10M en=
tries within hours in a cluster of 6 cheap hardware nodes. :-)<br><br></div=
>saluti,<br>Stefano<br></div><div><div><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">2014-12-03 18:31 GMT+01:00 Stefano Bortoli <span dir=
=3D"ltr">&lt;<a href=3D"mailto:s.bortoli@gmail.com" target=3D"_blank">s.bor=
toli@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div><div><div>Hi all,<br><br>thanks for the feedback. For the mom=
ent, I hope I resolved the problem by compressing the string into a bite[] =
using a custom implementation of Value interface and LZ4 algorithm. I have =
a little overhead on the processing of some steps, but it should reduce net=
work traffic and required temporary space on disk. <br><br></div>I think th=
e problem is due to the two joins moving around quite a bit of data. Essent=
ially I join twice something like 230 million tuples with a dataset of 9.2 =
million entries (~80GB). Compression seems to be working fine so far, even =
though I did not reach the critical point yet. I&#39;ll keep you posted to =
let you know whether this workaround solved the problem.<br><br></div>I app=
lied a double join as an alternative to the repeat 230M*2 single gets on HB=
ase. Even though this allowed to completed the process in about 11h.<br><br=
></div>thanks a lot to everyone again.<br><br>saluti,<br>Stefano<br><div><b=
r><br><div><br><br><br></div></div></div><div><div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">2014-12-03 18:02 GMT+01:00 Flavio Pomperm=
aier <span dir=3D"ltr">&lt;<a href=3D"mailto:pompermaier@okkam.it" target=
=3D"_blank">pompermaier@okkam.it</a>&gt;</span>:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"ltr">I think I can answer on behalf of Stefano that is =
busy right now..the job failed because on the job manager (that is also a t=
ask manager) the temp folder was full.<div>We would like to understand how =
big should be the temp directory..which parameters should we consider to ma=
ke that computation?<div><div><br><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Wed, Dec 3, 2014 at 5:22 PM, Ufuk Celebi <span dir=3D"l=
tr">&lt;<a href=3D"mailto:uce@apache.org" target=3D"_blank">uce@apache.org<=
/a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Th=
e task managers log the temporary directories at start up. Can you have a l=
ook there and verify that you configured the temporary directories correctl=
y?<br></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_qu=
ote">On Wed, Dec 3, 2014 at 5:17 PM, Stephan Ewen <span dir=3D"ltr">&lt;<a =
href=3D"mailto:sewen@apache.org" target=3D"_blank">sewen@apache.org</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi!<div><=
br></div><div>That exception means that one of the directories is full. If =
you have several temp directories on different disks, you can add them all =
to the config and the temp files will be rotated across the disks.</div><di=
v><br></div><div>The exception may come once the first temp directory is fu=
ll. For example, if you have 4 temp dirs (where 1 is rather full while the =
others have a lot of space), it may be that one temp file on the full direc=
tory grows large and exceeds the space, while the other directories have pl=
enty of space.</div><div><br></div><div>Greetings,</div><div>Stephan</div><=
div><br></div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Wed, Dec 3, 2014 at 4:40 PM, Robert Metzger <span dir=3D"lt=
r">&lt;<a href=3D"mailto:rmetzger@apache.org" target=3D"_blank">rmetzger@ap=
ache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Hi,<div><br></div><div>I think Flink is deleting its temporary fil=
es.</div><div><br></div><div>Is the temp. path set to the SSD on each machi=
ne?</div><div>What is the size of the two data sets your are joining? Your =
cluster has=C2=A06*256GB =3D 1.5 TB of temporary disk space.</div><div>Mayb=
e only the temp directory of one node is full?</div><div><br></div><div><br=
></div><div><br></div></div><div><div><div class=3D"gmail_extra"><br><div c=
lass=3D"gmail_quote">On Wed, Dec 3, 2014 at 3:52 PM, Ufuk Celebi <span dir=
=3D"ltr">&lt;<a href=3D"mailto:uce@apache.org" target=3D"_blank">uce@apache=
.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r"><div><div><div>Hey Stefano,<br><br></div>I would wait for Stephan&#39;s =
take on this, but with caught IOExceptions the hash table should properly c=
lean up after itself and delete the file.<br><br></div>Can you still reprod=
uce this problem for your use case?<span><font color=3D"#888888"><br><br></=
font></span></div><span><font color=3D"#888888">=E2=80=93 Ufuk<br><div><br>=
</div></font></span></div><div><div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Tue, Dec 2, 2014 at 7:07 PM, Stefano Bortoli <span di=
r=3D"ltr">&lt;<a href=3D"mailto:s.bortoli@gmail.com" target=3D"_blank">s.bo=
rtoli@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr"><div><div>Hi guys,<br><br></div>a quite long process failed d=
ue to this No Space Left on Device exception, but the machine disk is not f=
ull at all. <br><br>okkam@okkam-nano-2:/opt/flink-0.8$ df<br>Filesystem=C2=
=A0=C2=A0=C2=A0=C2=A0 1K-blocks=C2=A0=C2=A0=C2=A0=C2=A0 Used Available Use%=
 Mounted on<br>/dev/sdb2=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 223302236 22819504 1=
89116588=C2=A0 11% /<br>none=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 4=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 4=C2=A0=C2=A0 0% /sys/fs/cgroup<br>udev=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8156864=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 4=C2=A0=C2=A0 8156860=C2=A0=C2=A0 1% /dev<br>tmpfs=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1633520=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 524=C2=A0=C2=A0 1632996=C2=A0=C2=A0 1% /run<=
br>none=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 5120=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 5120=C2=A0=C2=A0 0% /run/lock<br>none=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8167584=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0 8167584=C2=A0=C2=
=A0 0% /run/shm<br>none=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 102400=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 0=C2=A0=C2=A0=C2=A0 102400=C2=A0=C2=A0 0% /run/user<br>/dev/sdb1=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 523248=C2=A0=C2=A0=C2=A0=C2=A0 3=
428=C2=A0=C2=A0=C2=A0 519820=C2=A0=C2=A0 1% /boot/efi<br>/dev/sda1=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 961302560=C2=A0 2218352 910229748=C2=A0=C2=A0 1% /med=
ia/data<br>cm_processes=C2=A0=C2=A0=C2=A0=C2=A0 8167584=C2=A0=C2=A0=C2=A0 1=
2116=C2=A0=C2=A0 8155468=C2=A0=C2=A0 1% /run/cloudera-scm-agent/process<br>=
<br>Is it possible that the temporary files were deleted &#39;after the pro=
blem&#39;? I read so, but there was no confirmation. However, it is a 256SS=
D disk. Each of the 6 nodes has it. <br><br></div>Here is the stack trace:<=
br><div><br>16:37:59,581 ERROR org.apache.flink.runtime.operators.RegularPa=
ctTask=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - =
Error in task code:=C2=A0 CHAIN Join (org.okkam.flink.maintenance.deduplica=
tion.consolidate.Join2ToGetCandidates) -&gt; Filter (org.okkam.flink.mainte=
nance.deduplication.match.SingleMatchFilterFunctionWithFlagMatch) -&gt; Map=
 (org.okkam.flink.maintenance.deduplication.match.MapToTuple3MapFunction) -=
&gt; Combine(org.apache.flink.api.java.operators.DistinctOperator$DistinctF=
unction) (4/28)<br>java.io.IOException: The channel is erroneous.<br>=C2=A0=
=C2=A0=C2=A0 at org.apache.flink.runtime.io.disk.iomanager.ChannelAccess.ch=
eckErroneous(ChannelAccess.java:132)<br>=C2=A0=C2=A0=C2=A0 at org.apache.fl=
ink.runtime.io.disk.iomanager.BlockChannelWriter.writeBlock(BlockChannelWri=
ter.java:73)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.io.disk.ioma=
nager.ChannelWriterOutputView.writeSegment(ChannelWriterOutputView.java:218=
)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.io.disk.iomanager.Chann=
elWriterOutputView.nextSegment(ChannelWriterOutputView.java:204)<br>=C2=A0=
=C2=A0=C2=A0 at org.apache.flink.runtime.memorymanager.AbstractPagedOutputV=
iew.advance(AbstractPagedOutputView.java:140)<br>=C2=A0=C2=A0=C2=A0 at org.=
apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(Abstra=
ctPagedOutputView.java:223)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runti=
me.memorymanager.AbstractPagedOutputView.write(AbstractPagedOutputView.java=
:173)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.types.StringValue.writeStri=
ng(StringValue.java:808)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.api.comm=
on.typeutils.base.StringSerializer.serialize(StringSerializer.java:68)<br>=
=C2=A0=C2=A0=C2=A0 at org.apache.flink.api.common.typeutils.base.StringSeri=
alizer.serialize(StringSerializer.java:28)<br>=C2=A0=C2=A0=C2=A0 at org.apa=
che.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSeriali=
zer.java:95)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.api.java.typeutils.r=
untime.TupleSerializer.serialize(TupleSerializer.java:30)<br>=C2=A0=C2=A0=
=C2=A0 at org.apache.flink.runtime.operators.hash.HashPartition.insertIntoP=
robeBuffer(HashPartition.java:269)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flin=
k.runtime.operators.hash.MutableHashTable.processProbeIter(MutableHashTable=
.java:474)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.operators.hash=
.MutableHashTable.nextRecord(MutableHashTable.java:537)<br>=C2=A0=C2=A0=C2=
=A0 at org.apache.flink.runtime.operators.hash.BuildSecondHashMatchIterator=
.callWithNextKey(BuildSecondHashMatchIterator.java:106)<br>=C2=A0=C2=A0=C2=
=A0 at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:=
148)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.operators.RegularPac=
tTask.run(RegularPactTask.java:484)<br>=C2=A0=C2=A0=C2=A0 at org.apache.fli=
nk.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:359)<br>=
=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.execution.RuntimeEnvironment=
.run(RuntimeEnvironment.java:246)<br>=C2=A0=C2=A0=C2=A0 at java.lang.Thread=
.run(Thread.java:745)<br>Caused by: java.io.IOException: No space left on d=
evice<br>=C2=A0=C2=A0=C2=A0 at sun.nio.ch.FileDispatcherImpl.write0(Native =
Method)<br>=C2=A0=C2=A0=C2=A0 at sun.nio.ch.FileDispatcherImpl.write(FileDi=
spatcherImpl.java:60)<br>=C2=A0=C2=A0=C2=A0 at sun.nio.ch.IOUtil.writeFromN=
ativeBuffer(IOUtil.java:93)<br>=C2=A0=C2=A0=C2=A0 at sun.nio.ch.IOUtil.writ=
e(IOUtil.java:65)<br>=C2=A0=C2=A0=C2=A0 at sun.nio.ch.FileChannelImpl.write=
(FileChannelImpl.java:205)<br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtim=
e.io.disk.iomanager.SegmentWriteRequest.write(BlockChannelAccess.java:259)<=
br>=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.io.disk.iomanager.IOManag=
er$WriterThread.run(IOManager.java:636)<br><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><div><div dir=3D"ltr"><br><p></p><p></p><p><=
/p><p></p></div></div>
</div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--e89a8f234c039625ab0509628c2c--