Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C1D017FA5 for ; Thu, 22 Jan 2015 06:19:17 +0000 (UTC) Received: (qmail 14694 invoked by uid 500); 22 Jan 2015 06:19:17 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 14637 invoked by uid 500); 22 Jan 2015 06:19:17 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 14622 invoked by uid 99); 22 Jan 2015 06:19:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 06:19:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zyacer@gmail.com designates 209.85.220.46 as permitted sender) Received: from [209.85.220.46] (HELO mail-pa0-f46.google.com) (209.85.220.46) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 06:19:12 +0000 Received: by mail-pa0-f46.google.com with SMTP id lf10so58036282pab.5 for ; Wed, 21 Jan 2015 22:18:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; bh=CgvGnwnqhbCJJ+LaghXk614sBVz53HpRRkas4yfFgz0=; b=aQAxM/jpFf+edLWkbOyk3LfeP27IPVo0WtXEtYNO+HKdYkJsmRMypTV1FdB1GfL3KX FBg2aV1j7fCisRUBJQQOcm4I5kost0MQyIT680npctjvDkqHpEBDiXiDqSqNWVEZkwy5 9lhHS8IkePnU1jeKdCTWa+VJMVvzMPSvpLClkE8PRellmoV7TubuiNQ8B40Tk1/OGJ8X ztHjd/rrwvK7D7l9m2XZkI/u/OpG95Msbxi0docpJKe4zWhK3/HsPEZ7tVPvKXU4sRIC TQ+Vlp/40sYGWhOtsBd4I/HZmL62ZHuUFKbq39uXWPGZ4HKU89LY5ej2zDxer/Qro2nX LvRA== X-Received: by 10.66.102.106 with SMTP id fn10mr69426218pab.156.1421907532000; Wed, 21 Jan 2015 22:18:52 -0800 (PST) Received: from [192.168.0.234] (li608-148.members.linode.com. [106.186.28.148]) by mx.google.com with ESMTPSA id k3sm4898818pdj.2.2015.01.21.22.18.49 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jan 2015 22:18:50 -0800 (PST) Message-ID: <54C09636.8010907@gmail.com> Date: Thu, 22 Jan 2015 14:18:30 +0800 From: Alex User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: user@flume.apache.org Subject: Re: Flume loss data when collect online data to hdfs References: <1421902307657.ef968fce@Nodemailer> In-Reply-To: <1421902307657.ef968fce@Nodemailer> Content-Type: multipart/alternative; boundary="------------080300040002040104040701" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------080300040002040104040701 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit 1: In agent1, there is a "regex_extractor" interceptor for extracting header "dt" #interceptors agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).* agent1.sources.src_spooldir.interceptors.i1.serializers=s1 agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name =dt in agent2, the hdfs sink use the header in the path, this is the configurations: agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt} 2: I misunderstood this property, thank you for revision. Thanks, Alex On 1/22/2015 12:51 PM, Hari Shreedharan wrote: > 1: How do you guarantee that the data from the previous day has not > spilled over to the next day? Where are you inserting the timestamp > (if you are doing bucketing). > 2: Flume creates transactions for writes. Each batch defaults to 1000 > events, which are written and flushed. There is still only one > transaction per sink, the pool size is for IO ops. > > Thanks, > Hari > > > On Wed, Jan 21, 2015 at 7:32 PM, Jay Alexander > wrote: > > First Question: No, I query the all the file in hdfs had been > closed, exactly I account the data one day later. > > Second Question: I hadn't config any about the transaction. And I > saw there is an item in the hdfs sink > configuration:"hdfs.threadsPoolSize10Number of threads per HDFS > sink for HDFS IO ops (open, write, etc.)". > So there is 10 transactions per sink from the file channel. > > Thanks. > > > 2015-01-22 11:04 GMT+08:00 Hari Shreedharan > >: > > Are you accounting for the data still being written but not > yet hflushed at the time of the query? Basically one > transaction per sink ? > > Thanks, > Hari > > > On Wed, Jan 21, 2015 at 6:42 PM, Jay Alexander > > wrote: > > I used *flume-ng 1.5* version to collect logs. > > There are two agents in the data flow and they are on two > hosts, respectively. > > And the data is sended *from agent1 to agent2.* > > The agents's component is as follows: > > agent1: spooling dir source --> file channel --> avro sink > agent2: avro source --> file channel --> hdfs sink > > But it seems to loss data about 1/1000 percentage of > million data.To solve problem I tried these steps: > > 1. look up agents log: cannot find any error or exception. > 2. look up agents monitor metrics: the events number that > put and take from channel always equals > 3. statistic the data number by hive query and hdfs file > use shell, respectively: the two number is equal and > less than the online data number > > > These are the two agents configuration: > > #agent1 > agent1.sources = src_spooldir > agent1.channels = chan_file > agent1.sinks = sink_avro > > #source > agent1.sources.src_spooldir.type = spooldir > agent1.sources.src_spooldir.spoolDir = > /data/logs/flume-spooldir > agent1.sources.src_spooldir.interceptors=i1 > > #interceptors > agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor > agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).* > agent1.sources.src_spooldir.interceptors.i1.serializers=s1 > agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name > =dt > > #sink > agent1.sinks.sink_avro.type = avro > agent1.sinks.sink_avro.hostname = 10.235.2.212 > agent1.sinks.sink_avro.port = 9910 > > #channel > agent1.channels.chan_file.type = file > agent1.channels.chan_file.checkpointDir = > /data/flume/agent1/checkpoint > agent1.channels.chan_file.dataDirs = > /data/flume/agent1/data > > agent1.sources.src_spooldir.channels = chan_file > agent1.sinks.sink_avro.channel = chan_file > > > > # agent2 > agent2.sources = source1 > agent2.channels = channel1 > agent2.sinks = sink1 > > # source > agent2.sources.source1.type = avro > agent2.sources.source1.bind = 10.235.2.212 > agent2.sources.source1.port = 9910 > > # sink > agent2.sinks.sink1.type= hdfs > agent2.sinks.sink1.hdfs.fileType = DataStream > agent2.sinks.sink1.hdfs.filePrefix = log > agent2.sinks.sink1.hdfs.path = > hdfs://hnd.hadoop.jsh:8020/data/%{dt} > agent2.sinks.sink1.hdfs.rollInterval = 600 > agent2.sinks.sink1.hdfs.rollSize = 0 > agent2.sinks.sink1.hdfs.rollCount = 0 > agent2.sinks.sink1.hdfs.idleTimeout = 300 > agent2.sinks.sink1.hdfs.round = true > agent2.sinks.sink1.hdfs.roundValue = 10 > agent2.sinks.sink1.hdfs.roundUnit = minute > > # channel > agent2.channels.channel1.type = file > agent2.channels.channel1.checkpointDir = > /data/flume/agent2/checkpoint > agent2.channels.channel1.dataDirs = > /data/flume/agent2/data > agent2.sinks.sink1.channel = channel1 > agent2.sources.source1.channels = channel1 > > > Any suggestions are welcome! > > > > --------------080300040002040104040701 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit 1: In agent1, there is a "regex_extractor" interceptor for extracting header "dt"
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
 in agent2, the hdfs sink use the header in the path, this is the configurations:
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}

2: I misunderstood this property, thank you for revision.

Thanks,
Alex


On 1/22/2015 12:51 PM, Hari Shreedharan wrote:
1: How do you guarantee that the data from the previous day has not spilled over to the next day? Where are you inserting the timestamp (if you are doing bucketing).
2: Flume creates transactions for writes. Each batch defaults to 1000 events, which are written  and flushed.  There is still only one transaction per sink, the pool size is for IO ops.

Thanks, 
Hari


On Wed, Jan 21, 2015 at 7:32 PM, Jay Alexander <zyacer@gmail.com> wrote:

First Question: No, I query the all the file in hdfs had been closed, exactly I account the data one day later. 

Second Question: I hadn't config any about the transaction. And I saw there is an item in the hdfs sink configuration:"hdfs.threadsPoolSize 10 Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)". 
So there is 10 transactions per sink from the file channel.

Thanks.


2015-01-22 11:04 GMT+08:00 Hari Shreedharan <hshreedharan@cloudera.com>:
Are you accounting for the data still being written but not yet hflushed at the time of the query? Basically one transaction per sink ?

Thanks, 
Hari


On Wed, Jan 21, 2015 at 6:42 PM, Jay Alexander <zyacer@gmail.com> wrote:

I used flume-ng 1.5 version to collect logs.

There are two agents in the data flow and they are on two hosts, respectively.

And the data is sended from agent1 to agent2.

The agents's component is as follows:

agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink

But it seems to loss data about 1/1000 percentage of million data.To solve problem I tried these steps:
  1. look up agents log: cannot find any error or exception.
  2. look up agents monitor metrics: the events number that put and take from channel always equals
  3. statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number

These are the two agents configuration:
#agent1
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro

#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1

#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1

#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910

#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data

agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file



# agent2 
agent2.sources  = source1
agent2.channels = channel1 
agent2.sinks    = sink1 

# source
agent2.sources.source1.type     = avro
agent2.sources.source1.bind     = 10.235.2.212
agent2.sources.source1.port     = 9910

# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute

# channel
agent2.channels.channel1.type   = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel      = channel1
agent2.sources.source1.channels = channel1

Any suggestions are welcome! 





--------------080300040002040104040701--