incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <eric...@gmail.com>
Subject Re: chukwa agent doesn't collect the log suddenly , and after several days ,the agent crashes.
Date Fri, 29 Jul 2011 16:00:27 GMT
Hi Ivy,

In Http Sender, there is a throttle to sleep 5 seconds, if the data sending frequency is high.
 If the lastPosted time is more recent than the current time, then the value becomes negative.
 This could happen when ntp makes time correction.  When sleeping on negative millisecond,
then it throws exception, and prevents agent to send data to collector.  I filed a JIRA (CHUKWA-595)
for this, thanks for catching this.  :)

regards,
Eric

On Jul 28, 2011, at 11:59 PM, Ying Tang wrote:

> Hi Eric,
>     I meet with an exception.
>  
>   Exception in thread "HTTP post thread" java.lang.IllegalArgumentException: timeout
value is negative
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.chukwa.datacollection.connector.http.HttpConnector.run(HttpConnector.java:187)
>         at java.lang.Thread.run(Thread.java:636)
>  
> And the log " INFO Timer-1 HttpConnector - # http chunks ACK'ed since last report: 0
" appears again.
> On Thu, Jul 28, 2011 at 3:23 AM, Eric Yang <eric818@gmail.com> wrote:
> We need two numbers to track rotated logs properly without sending duplicates.  A number
is the current file offset, and another number is the incremental sequence number.  The checkpoint
file only preserves the incremental sequence number.  The current file offset is not tracked.
 If agent crashes, agent will stream from offset 0 of the current file.  Hence, there maybe
duplicate entries, when system recovers from agent crash.  I tested this, after file rotated,
it is sending the new data from the current log file, but the file current offset is tracked
in memory only.  It is not visible from port 9093.
> 
> This issue is not directly related to the issue that you experienced which the file tailing
adaptor.  From the code, it is still not clear why agent crashed.  However, I would recommend
to use Chukwa trunk with SocketAdaptor with Log4JSocketAppender.  It is more reliable than
UTF8 adaptor.
> 
> See: http://wiki.apache.org/hadoop/Chukwa_Quick_Start
> 
> for installation instructions.
> 
> regards,
> Eric
> 
> On Jul 26, 2011, at 10:54 PM, Ying Tang wrote:
> 
> > And after i restart the chukwa agent , the telnet adaptor last number is still 10487067.
> >
> > On Wed, Jul 27, 2011 at 1:52 PM, Ying Tang <ivytang0812@gmail.com> wrote:
> > Does this mean the adaptor still transfer the previous log ? The current log is
missing?
> >
> >
> > On Wed, Jul 27, 2011 at 1:08 PM, Eric Yang <eric818@gmail.com> wrote:
> > This looks like a bug, the last number should be in sync with the
> > current file's size, but the UTF adaptor is still tailing the previous
> > file (which rotated at 10487067)
> > It means there is a bug in handling the file rotation, but the adaptor
> > did not pick up the change.
> >
> > Please open a jira.  Thanks
> >
> > regards,
> > Eric
> >
> > On Tue, Jul 26, 2011 at 8:05 PM, Ying Tang <ivytang0812@gmail.com> wrote:
> > > The log didn't rotate very  rapidly.
> > >
> > > Now i can't rebuild the scenario . But when the chukwa agent log looks ok,
> > >
> > >  2011-07-27 10:57:38,967 INFO Timer-0 ChukwaAgent - writing checkpoint
> > > 1307083
> > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - collected
1
> > > chunks for post_745
> > > 2011-07-27 10:57:42,571 INFO HTTP post thread ChukwaHttpSender - >>>>>>
HTTP
> > > post_745 to http://chukwacollector1.xingcloud.com:9095/ length = 1837
> > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender - >>>>>>
HTTP
> > > Got success back from http://chukwacollector1.xingcloud.com:9095/chukwa;
> > > response length 43
> > > 2011-07-27 10:57:42,574 INFO HTTP post thread ChukwaHttpSender - post_745
> > > sent 0 chunks, got back 1 acks
> > >
> > > The list in telnet agent 9093 is:
> > > adaptor_2963225a90653a309cf779d4a1d815a3)
> > > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > > Gamelog 0 /var/log/gamelog 10487067
> > > After several minites ,  the list is still
> > > adaptor_2963225a90653a309cf779d4a1d815a3)
> > > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > > Gamelog 0 /var/log/gamelog 10487067
> > >
> > > Is the 10487067 the offset number ?The number didn't changed , and the log
> > > file's size is from 0 to 10M .And now the log file's size is 1150872.
> > >
> > > On Wed, Jul 27, 2011 at 12:26 AM, Eric Yang <eric818@gmail.com> wrote:
> > >>
> > >> CharFileTailingAdaptorUTF should handle log rotation gracefully.  Is the
> > >> log rotating rapidly?
> > >> Run those command on chukwa agent:
> > >> telnet localhost 9093
> > >> list
> > >> This should show a list of tailing files, and check the offset number of
> > >> the tailing log file.  The most right number should be smaller than the
size
> > >> of your log file.  If it is bigger and not changing, it is most likely
there
> > >> is a bug that we haven't seen before.  It might be useful to turn on debug
> > >> on chukwa agent and see if this can be reproduced to nail down the root
> > >> cause.  Thanks
> > >> regards,
> > >> Eric
> > >> On Jul 26, 2011, at 6:13 AM, Ying Tang wrote:
> > >>
> > >> Is there the possibility that
> > >> when the log file reaches the log4g config file size ,the log4j will
> > >> rename this log file and create a new file with this name as the log file
.
> > >> At the time ,the chukwa adaptor doesn't tail the log properly , and this
> > >> cause the chuwa agent can't collector the log any more.
> > >>
> > >> On Tue, Jul 26, 2011 at 2:07 PM, Ying Tang <ivytang0812@gmail.com>
wrote:
> > >>>
> > >>> The log file is log4j log file ,and the size is 10M ,the maxbackupindex
> > >>> is 1.
> > >>>
> > >>>
> > >>> On Tue, Jul 26, 2011 at 1:42 PM, Eric Yang <eric818@gmail.com>
wrote:
> > >>>>
> > >>>> Can you run "ls -l" to show the size and dateof the log files that
you
> > >>>> are streaming?
> > >>>>
> > >>>> regards,
> > >>>> Eric
> > >>>>
> > >>>> On Mon, Jul 25, 2011 at 7:36 PM, Ying Tang <ivytang0812@gmail.com>
> > >>>> wrote:
> > >>>> > The chukwa version is 0.4.0 and the adaptor is
> > >>>> >
> > >>>> > org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.CharFileTailingAdaptorUTF8
> > >>>> >
> > >>>> > On Mon, Jul 25, 2011 at 11:50 PM, Eric Yang <eric818@gmail.com>
wrote:
> > >>>> >>
> > >>>> >> Hi Ivy,
> > >>>> >>
> > >>>> >> When data is send from agent to collector, collector send
> > >>>> >> acknowledgment
> > >>>> >> of receiving of the chunks.  At 00:03:28, there are 5
chunks
> > >>>> >> acknowledged.
> > >>>> >>  This means communication between collector and agent
are working at
> > >>>> >> that
> > >>>> >> point in time.  However, there is no activity after 00:04:28.
 This
> > >>>> >> looks
> > >>>> >> like adaptor did not handle the log rotation properly
at close to
> > >>>> >> midnight.
> > >>>> >>  Which version of Chukwa are you using and which adaptor
are you
> > >>>> >> using?
> > >>>> >>
> > >>>> >> regards,
> > >>>> >> Eric
> > >>>> >>
> > >>>> >> On Jul 25, 2011, at 12:40 AM, Ying Tang wrote:
> > >>>> >>
> > >>>> >> > Hi all,
> > >>>> >> >
> > >>>> >> > In my cluster , i have two chukwa agent and one collector
.
> > >>>> >> > At a time ,  both chukwa agents's log :
> > >>>> >> > 2011-07-18 00:03:28,688 INFO Timer-1 HttpConnector
- # http chunks
> > >>>> >> > ACK'ed since last report: 5
> > >>>> >> > 2011-07-18 00:04:28,697 INFO Timer-1 HttpConnector
- # http chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:05:28,706 INFO Timer-1 HttpConnector
- # http chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:06:28,714 INFO Timer-1 HttpConnector
- # http chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> > 2011-07-18 00:07:29,340 INFO Timer-1 HttpConnector
- # http chunks
> > >>>> >> > ACK'ed since last report: 0
> > >>>> >> >
> > >>>> >> > And the collector
> > >>>> >> > 2011-07-17 11:02:32,155 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:02:43,074 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:03:02,162 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:03:32,168 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:03:43,085 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:04:02,174 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:04:32,180 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> > 2011-07-17 11:04:43,096 INFO Timer-1 root -
> > >>>> >> > stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> > >>>> >> > 2011-07-17 11:05:02,185 INFO Timer-3 SeqFileWriter
-
> > >>>> >> > stat:datacollection.writer.hdfs dataSize=0 dataRate=0
> > >>>> >> >
> > >>>> >> > (the collector and agent has  different  timezone)
> > >>>> >> > And the collector didn't collect any log.
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > What dons the "http chunks ACK'ed since last report:
0" means?
> > >>>> >> > And from this log "http chunks ACK'ed since last
report: 0" appears
> > >>>> >> > to
> > >>>> >> >  agent crash, the chukwa port still on , but after
several days,
> > >>>> >> > both agents
> > >>>> >> > crashed without exceptions.
> > >>>> >> >
> > >>>> >> >
> > >>>> >> > --
> > >>>> >> > Best regards,
> > >>>> >> >
> > >>>> >> > Ivy Tang
> > >>>> >> >
> > >>>> >> >
> > >>>> >> >
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > Best regards,
> > >>>> > Ivy Tang
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Best regards,
> > >>> Ivy Tang
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Ivy Tang
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivy Tang
> > >
> > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Ivy Tang
> >
> >
> >
> >
> >
> >
> > --
> > Best regards,
> >
> > Ivy Tang
> >
> >
> >
> 
> 
> 
> 
> -- 
> Best regards,
> 
> Ivy Tang
> 
> 
> 


Mime
View raw message