Return-Path: X-Original-To: apmail-incubator-chukwa-user-archive@www.apache.org Delivered-To: apmail-incubator-chukwa-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D1EE36794 for ; Wed, 27 Jul 2011 05:56:22 +0000 (UTC) Received: (qmail 87666 invoked by uid 500); 27 Jul 2011 05:55:51 -0000 Delivered-To: apmail-incubator-chukwa-user-archive@incubator.apache.org Received: (qmail 86757 invoked by uid 500); 27 Jul 2011 05:55:37 -0000 Mailing-List: contact chukwa-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-user@incubator.apache.org Delivered-To: mailing list chukwa-user@incubator.apache.org Received: (qmail 86679 invoked by uid 99); 27 Jul 2011 05:55:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jul 2011 05:55:33 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ivytang0812@gmail.com designates 209.85.210.175 as permitted sender) Received: from [209.85.210.175] (HELO mail-iy0-f175.google.com) (209.85.210.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jul 2011 05:55:25 +0000 Received: by iyj12 with SMTP id 12so1578558iyj.6 for ; Tue, 26 Jul 2011 22:55:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=M8Oh2lg0dteZ9sqA9jtrqO0AAGw9hUy6Tl5dCpfIoRA=; b=wGxU5d2b1erpX1cdrlpex2DHWda8zZszWNCzdqu6YeS7ByJZbL/hSqQZKEeZjs2EGM Mb2xko8vjrKN+w9OzdW8qeHU71hSf2MZuXvHJEzb7UZAeWtjHcl3XdpyNnJpl37x9kCP 1B0kDDzKyLOIGtna4NdaByg4LC+yZYVvbepf4= Received: by 10.231.121.30 with SMTP id f30mr6344955ibr.109.1311746104064; Tue, 26 Jul 2011 22:55:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.190.4 with HTTP; Tue, 26 Jul 2011 22:54:44 -0700 (PDT) In-Reply-To: References: <3805815A-B641-4E71-A1B6-6D66E92FB055@gmail.com> <2F10CD65-C922-4F1A-BD8B-ABFEA6B84FE3@gmail.com> From: Ying Tang Date: Wed, 27 Jul 2011 13:54:44 +0800 Message-ID: Subject: Re: chukwa agent doesn't collect the log suddenly , and after several days ,the agent crashes. To: chukwa-user@incubator.apache.org Content-Type: multipart/alternative; boundary=0016e64ec93e917a5804a906b01b X-Virus-Checked: Checked by ClamAV on apache.org --0016e64ec93e917a5804a906b01b Content-Type: text/plain; charset=ISO-8859-1 And after i restart the chukwa agent , the telnet adaptor last number is still 10487067. On Wed, Jul 27, 2011 at 1:52 PM, Ying Tang wrote: > Does this mean the adaptor still transfer the previous log ? The current > log is missing? > > > On Wed, Jul 27, 2011 at 1:08 PM, Eric Yang wrote: > >> This looks like a bug, the last number should be in sync with the >> current file's size, but the UTF adaptor is still tailing the previous >> file (which rotated at 10487067) >> It means there is a bug in handling the file rotation, but the adaptor >> did not pick up the change. >> >> Please open a jira. Thanks >> >> regards, >> Eric >> >> On Tue, Jul 26, 2011 at 8:05 PM, Ying Tang wrote: >> > The log didn't rotate very rapidly. >> > >> > Now i can't rebuild the scenario . But when the chukwa agent log looks >> ok, >> > >> > 2011-07-27 10:57:38,967 INFO Timer-0 ChukwaAgent - writing checkpoint >> > 1307083 >> > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - >> collected 1 >> > chunks for post_745 >> > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - >>>>>> >> HTTP >> > post_745 to http://chukwacollector1.xingcloud.com:9095/ length = 1837 >> > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender - >>>>>> >> HTTP >> > Got success back from http://chukwacollector1.xingcloud.com:9095/chukwa >> ; >> > response length 43 >> > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender - >> post_745 >> > sent 0 chunks, got back 1 acks >> > >> > The list in telnet agent 9093 is: >> > adaptor_2963225a90653a309cf779d4a1d815a3) >> > >> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8 >> > Gamelog 0 /var/log/gamelog 10487067 >> > After several minites , the list is still >> > adaptor_2963225a90653a309cf779d4a1d815a3) >> > >> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8 >> > Gamelog 0 /var/log/gamelog 10487067 >> > >> > Is the 10487067 the offset number ?The number didn't changed , and the >> log >> > file's size is from 0 to 10M .And now the log file's size is 1150872. >> > >> > On Wed, Jul 27, 2011 at 12:26 AM, Eric Yang wrote: >> >> >> >> CharFileTailingAdaptorUTF should handle log rotation gracefully. Is >> the >> >> log rotating rapidly? >> >> Run those command on chukwa agent: >> >> telnet localhost 9093 >> >> list >> >> This should show a list of tailing files, and check the offset number >> of >> >> the tailing log file. The most right number should be smaller than the >> size >> >> of your log file. If it is bigger and not changing, it is most likely >> there >> >> is a bug that we haven't seen before. It might be useful to turn on >> debug >> >> on chukwa agent and see if this can be reproduced to nail down the root >> >> cause. Thanks >> >> regards, >> >> Eric >> >> On Jul 26, 2011, at 6:13 AM, Ying Tang wrote: >> >> >> >> Is there the possibility that >> >> when the log file reaches the log4g config file size ,the log4j will >> >> rename this log file and create a new file with this name as the log >> file . >> >> At the time ,the chukwa adaptor doesn't tail the log properly , and >> this >> >> cause the chuwa agent can't collector the log any more. >> >> >> >> On Tue, Jul 26, 2011 at 2:07 PM, Ying Tang >> wrote: >> >>> >> >>> The log file is log4j log file ,and the size is 10M ,the >> maxbackupindex >> >>> is 1. >> >>> >> >>> >> >>> On Tue, Jul 26, 2011 at 1:42 PM, Eric Yang wrote: >> >>>> >> >>>> Can you run "ls -l" to show the size and dateof the log files that >> you >> >>>> are streaming? >> >>>> >> >>>> regards, >> >>>> Eric >> >>>> >> >>>> On Mon, Jul 25, 2011 at 7:36 PM, Ying Tang >> >>>> wrote: >> >>>> > The chukwa version is 0.4.0 and the adaptor is >> >>>> > >> >>>> > >> org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8 >> >>>> > >> >>>> > On Mon, Jul 25, 2011 at 11:50 PM, Eric Yang >> wrote: >> >>>> >> >> >>>> >> Hi Ivy, >> >>>> >> >> >>>> >> When data is send from agent to collector, collector send >> >>>> >> acknowledgment >> >>>> >> of receiving of the chunks. At 00:03:28, there are 5 chunks >> >>>> >> acknowledged. >> >>>> >> This means communication between collector and agent are working >> at >> >>>> >> that >> >>>> >> point in time. However, there is no activity after 00:04:28. >> This >> >>>> >> looks >> >>>> >> like adaptor did not handle the log rotation properly at close to >> >>>> >> midnight. >> >>>> >> Which version of Chukwa are you using and which adaptor are you >> >>>> >> using? >> >>>> >> >> >>>> >> regards, >> >>>> >> Eric >> >>>> >> >> >>>> >> On Jul 25, 2011, at 12:40 AM, Ying Tang wrote: >> >>>> >> >> >>>> >> > Hi all, >> >>>> >> > >> >>>> >> > In my cluster , i have two chukwa agent and one collector . >> >>>> >> > At a time , both chukwa agents's log : >> >>>> >> > 2011-07-18 00:03:28,688 INFO Timer-1 HttpConnector - # http >> chunks >> >>>> >> > ACK'ed since last report: 5 >> >>>> >> > 2011-07-18 00:04:28,697 INFO Timer-1 HttpConnector - # http >> chunks >> >>>> >> > ACK'ed since last report: 0 >> >>>> >> > 2011-07-18 00:05:28,706 INFO Timer-1 HttpConnector - # http >> chunks >> >>>> >> > ACK'ed since last report: 0 >> >>>> >> > 2011-07-18 00:06:28,714 INFO Timer-1 HttpConnector - # http >> chunks >> >>>> >> > ACK'ed since last report: 0 >> >>>> >> > 2011-07-18 00:07:29,340 INFO Timer-1 HttpConnector - # http >> chunks >> >>>> >> > ACK'ed since last report: 0 >> >>>> >> > >> >>>> >> > And the collector >> >>>> >> > 2011-07-17 11:02:32,155 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > 2011-07-17 11:02:43,074 INFO Timer-1 root - >> >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 >> >>>> >> > 2011-07-17 11:03:02,162 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > 2011-07-17 11:03:32,168 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > 2011-07-17 11:03:43,085 INFO Timer-1 root - >> >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 >> >>>> >> > 2011-07-17 11:04:02,174 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > 2011-07-17 11:04:32,180 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > 2011-07-17 11:04:43,096 INFO Timer-1 root - >> >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 >> >>>> >> > 2011-07-17 11:05:02,185 INFO Timer-3 SeqFileWriter - >> >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0 >> >>>> >> > >> >>>> >> > (the collector and agent has different timezone) >> >>>> >> > And the collector didn't collect any log. >> >>>> >> > >> >>>> >> > >> >>>> >> > What dons the "http chunks ACK'ed since last report: 0" means? >> >>>> >> > And from this log "http chunks ACK'ed since last report: 0" >> appears >> >>>> >> > to >> >>>> >> > agent crash, the chukwa port still on , but after several days, >> >>>> >> > both agents >> >>>> >> > crashed without exceptions. >> >>>> >> > >> >>>> >> > >> >>>> >> > -- >> >>>> >> > Best regards, >> >>>> >> > >> >>>> >> > Ivy Tang >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> >> >>>> > >> >>>> > >> >>>> > >> >>>> > -- >> >>>> > Best regards, >> >>>> > Ivy Tang >> >>>> > >> >>>> > >> >>>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Best regards, >> >>> Ivy Tang >> >>> >> >>> >> >> >> >> >> >> >> >> -- >> >> Best regards, >> >> Ivy Tang >> >> >> >> >> >> >> > >> > >> > >> > -- >> > Best regards, >> > Ivy Tang >> > >> > >> > >> > > > > -- > Best regards, > > Ivy Tang > > > > -- Best regards, Ivy Tang --0016e64ec93e917a5804a906b01b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable And after i restart the chukwa agent , the telnet adaptor last number is st= ill 10487067.

On Wed, Jul 27, 2011 at 1:52 PM, Ying Tang <ivytang0812@gmail= .com> wrote:
Does this mean the adaptor still= transfer the previous log ? The current log is missing?=20


On Wed, Jul 27, 2011 at 1:08 PM, Eric Yang <eri= c818@gmail.com> wrote:
This looks like a bug, the last = number should be in sync with the
current file's size, but the UTF a= daptor is still tailing the previous
file (which rotated at 10487067)
It means there is a bug in handling the= file rotation, but the adaptor
did not pick up the change.

Pleas= e open a jira. =A0Thanks

regards,
Eric

On Tue, Jul 26, 2011 at 8:05 PM, Ying Tang <ivytang0812@gmail.com> wrote= :
> The log didn't rotate very =A0rapidly.
>
> Now i = can't rebuild the=A0scenario . But when the chukwa agent log looks ok,<= br> >
> =A02011-07-27 10:57:38,967 INFO Timer-0 ChukwaAgent - writing = checkpoint
> 1307083
> 2011-07-27 10:57:42,571 INFO HTTP post t= hread ChukwaHttpSender - collected 1
> chunks for post_745
> 20= 11-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - >>>= >>> HTTP
> post_745 to http://chukwacollector1.xingcloud.com:9095/ length =3D = 1837
> 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender= - >>>>>> HTTP
> Got success back from http://chukwacollector1.xingcloud.com:9095/= chukwa;
> response length 43
> 2011-07-27 10:57:42,574 INFO= HTTP post thread ChukwaHttpSender - post_745
> sent 0 chunks, got back 1 acks
>
> The list in telnet agen= t 9093 is:
> adaptor_2963225a90653a309cf779d4a1d815a3)
> org.ap= ache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptor= UTF8
> Gamelog 0 /var/log/gamelog 10487067
> After several minites ,=A0= the list is still
> adaptor_2963225a90653a309cf779d4a1d815a3)
>= ; org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailin= gAdaptorUTF8
> Gamelog 0 /var/log/gamelog 10487067
>
> Is the 10487067 th= e offset number ?The number didn't changed=A0, and the log
> file= 's size is from 0 to 10M .And now the log file's size is 1150872. >
> On Wed, Jul 27, 2011 at 12:26 AM, Eric Yang <eric818@gmail.com> wrote:>>
>> CharFileTailingAdaptorUTF should handle log rotation= gracefully. =A0Is the
>> log rotating rapidly?
>> Run those command on chukwa agen= t:
>> telnet localhost 9093
>> list
>> This shou= ld show a list of tailing files, and check the offset number of
>>= the tailing log file. =A0The most right number should be smaller than the = size
>> of your log file. =A0If it is bigger and not changing, it is most = likely there
>> is a bug that we haven't seen before. =A0It mi= ght be useful to turn on debug
>> on chukwa agent and see if this = can be reproduced to nail down the root
>> cause. =A0Thanks
>> regards,
>> Eric
>>= On Jul 26, 2011, at 6:13 AM, Ying Tang wrote:
>>
>> Is t= here the possibility that
>> when the log file reaches the log4g c= onfig file size ,the log4j will
>> rename this log file and create a new file with this name as the l= og file .
>> At the time ,the chukwa adaptor doesn't tail the = log properly , and this
>> cause the chuwa agent can't collect= or the log any more.
>>
>> On Tue, Jul 26, 2011 at 2:07 PM, Ying Tang <ivytang0812@gmail.com> wrote:
>>>
>>> The log file is log4j log fil= e ,and the size is 10M ,the maxbackupindex
>>> is 1.
>>>
>>>
>>> On Tue, = Jul 26, 2011 at 1:42 PM, Eric Yang <
eric818@gmail.com> wrote:
>>>>
>>>> Can you run "ls -l" to show the size and dateof = the log files that you
>>>> are streaming?
>>>&g= t;
>>>> regards,
>>>> Eric
>>>>= ;
>>>> On Mon, Jul 25, 2011 at 7:36 PM, Ying Tang <ivytang0812@gmail.com&g= t;
>>>> wrote:
>>>> > The chukwa version i= s 0.4.0 and the adaptor is
>>>> >
>>>> > org.apache.hadoop.chukwa.dat= acollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
>>>&g= t; >
>>>> > On Mon, Jul 25, 2011 at 11:50 PM, Eric Yan= g <eric818@gmail.= com> wrote:
>>>> >>
>>>> >> Hi Ivy,
>>&= gt;> >>
>>>> >> When data is send from agent = to collector, collector send
>>>> >> acknowledgment >>>> >> of receiving of the chunks. =A0At 00:03:28, there= are 5 chunks
>>>> >> acknowledged.
>>>>= ; >> =A0This means communication between collector and agent are work= ing at
>>>> >> that
>>>> >> point in time. = =A0However, there is no activity after 00:04:28. =A0This
>>>>= ; >> looks
>>>> >> like adaptor did not handle t= he log rotation properly at close to
>>>> >> midnight.
>>>> >> =A0Which v= ersion of Chukwa are you using and which adaptor are you
>>>>= ; >> using?
>>>> >>
>>>> >>= regards,
>>>> >> Eric
>>>> >>
>>>= > >> On Jul 25, 2011, at 12:40 AM, Ying Tang wrote:
>>>= ;> >>
>>>> >> > Hi all,
>>>>= ; >> >
>>>> >> > In my cluster , i have two chukwa agent and = one collector .
>>>> >> > At a time , =A0both chukw= a agents's log :
>>>> >> > 2011-07-18 00:03:28,= 688 INFO Timer-1 HttpConnector - # http chunks
>>>> >> > ACK'ed since last report: 5
>>&= gt;> >> > 2011-07-18 00:04:28,697 INFO Timer-1 HttpConnector - = # http chunks
>>>> >> > ACK'ed since last repor= t: 0
>>>> >> > 2011-07-18 00:05:28,706 INFO Timer-1 HttpCon= nector - # http chunks
>>>> >> > ACK'ed since l= ast report: 0
>>>> >> > 2011-07-18 00:06:28,714 INF= O Timer-1 HttpConnector - # http chunks
>>>> >> > ACK'ed since last report: 0
>>&= gt;> >> > 2011-07-18 00:07:29,340 INFO Timer-1 HttpConnector - = # http chunks
>>>> >> > ACK'ed since last repor= t: 0
>>>> >> >
>>>> >> > And the co= llector
>>>> >> > 2011-07-17 11:02:32,155 INFO Time= r-3 SeqFileWriter -
>>>> >> > stat:datacollection.w= riter.hdfs dataSize=3D0 dataRate=3D0
>>>> >> > 2011-07-17 11:02:43,074 INFO Timer-1 root -<= br>>>>> >> > stats:ServletCollector,numberHTTPConnecti= on:0,numberchunks:0
>>>> >> > 2011-07-17 11:03:02,1= 62 INFO Timer-3 SeqFileWriter -
>>>> >> > stat:datacollection.writer.hdfs dataSize=3D0= dataRate=3D0
>>>> >> > 2011-07-17 11:03:32,168 INF= O Timer-3 SeqFileWriter -
>>>> >> > stat:datacollec= tion.writer.hdfs dataSize=3D0 dataRate=3D0
>>>> >> > 2011-07-17 11:03:43,085 INFO Timer-1 root -<= br>>>>> >> > stats:ServletCollector,numberHTTPConnecti= on:0,numberchunks:0
>>>> >> > 2011-07-17 11:04:02,1= 74 INFO Timer-3 SeqFileWriter -
>>>> >> > stat:datacollection.writer.hdfs dataSize=3D0= dataRate=3D0
>>>> >> > 2011-07-17 11:04:32,180 INF= O Timer-3 SeqFileWriter -
>>>> >> > stat:datacollec= tion.writer.hdfs dataSize=3D0 dataRate=3D0
>>>> >> > 2011-07-17 11:04:43,096 INFO Timer-1 root -<= br>>>>> >> > stats:ServletCollector,numberHTTPConnecti= on:0,numberchunks:0
>>>> >> > 2011-07-17 11:05:02,1= 85 INFO Timer-3 SeqFileWriter -
>>>> >> > stat:datacollection.writer.hdfs dataSize=3D0= dataRate=3D0
>>>> >> >
>>>> >>= ; > (the collector and agent has =A0different =A0timezone)
>>&g= t;> >> > And the collector didn't collect any log.
>>>> >> >
>>>> >> >
>>= ;>> >> > What dons the "http chunks ACK'ed since la= st report: 0" means?
>>>> >> > And from this l= og "http chunks ACK'ed since last report: 0" appears
>>>> >> > to
>>>> >> > =A0agen= t crash, the chukwa port still on , but after several days,
>>>= > >> > both agents
>>>> >> > crashed wi= thout exceptions.
>>>> >> >
>>>> >> >
>>= ;>> >> > --
>>>> >> > Best regards,<= br>>>>> >> >
>>>> >> > Ivy Tan= g
>>>> >> >
>>>> >> >
>>= ;>> >> >
>>>> >>
>>>> &g= t;
>>>> >
>>>> >
>>>> &g= t; --
>>>> > Best regards,
>>>> > Ivy Tang
&g= t;>>> >
>>>> >
>>>> >
&g= t;>>
>>>
>>>
>>> --
>>&g= t; Best regards,
>>> Ivy Tang
>>>
>>>
>>
>&g= t;
>>
>> --
>> Best regards,
>> Ivy Tan= g
>>
>>
>>
>
>
>
> -- > Best regards,
> Ivy Tang
>
>
>
=



--
Best regards,

Ivy Tang






-- <= br>
Best regards,

Ivy Tang



--0016e64ec93e917a5804a906b01b--