Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of zyacer@gmail.com designates
 209.85.220.46 as permitted sender)
Message-ID: <54C09636.8010907@gmail.com>
Date: Thu, 22 Jan 2015 14:18:30 +0800
From: Alex <zyacer@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: user@flume.apache.org
Subject: Re: Flume loss data when collect online data to hdfs
References: 
 <CAPo4H5WXnnbpg6STvabC1hceNYC-HDuffLAB6hsLS28VpxLTcw@mail.gmail.com>
 <1421902307657.ef968fce@Nodemailer>
In-Reply-To: <1421902307657.ef968fce@Nodemailer>
Content-Type: multipart/alternative;
 boundary="------------080300040002040104040701"

This is a multi-part message in MIME format.
--------------080300040002040104040701
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

1: In agent1, there is a "regex_extractor" interceptor for extracting 
header "dt"

    #interceptors
    agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
    agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
    agent1.sources.src_spooldir.interceptors.i1.serializers=s1
    agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name
    <http://agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name>=dt

  in agent2, the hdfs sink use the header in the path, this is the 
configurations:

    agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}

2: I misunderstood this property, thank you for revision.

Thanks,
Alex


On 1/22/2015 12:51 PM, Hari Shreedharan wrote:
> 1: How do you guarantee that the data from the previous day has not 
> spilled over to the next day? Where are you inserting the timestamp 
> (if you are doing bucketing).
> 2: Flume creates transactions for writes. Each batch defaults to 1000 
> events, which are written  and flushed.  There is still only one 
> transaction per sink, the pool size is for IO ops.
>
> Thanks,
> Hari
>
>
> On Wed, Jan 21, 2015 at 7:32 PM, Jay Alexander <zyacer@gmail.com 
> <mailto:zyacer@gmail.com>> wrote:
>
>     First Question: No, I query the all the file in hdfs had been
>     closed, exactly I account the data one day later.
>
>     Second Question: I hadn't config any about the transaction. And I
>     saw there is an item in the hdfs sink
>     configuration:"hdfs.threadsPoolSize10Number of threads per HDFS
>     sink for HDFS IO ops (open, write, etc.)".
>     So there is 10 transactions per sink from the file channel.
>
>     Thanks.
>
>
>     2015-01-22 11:04 GMT+08:00 Hari Shreedharan
>     <hshreedharan@cloudera.com <mailto:hshreedharan@cloudera.com>>:
>
>         Are you accounting for the data still being written but not
>         yet hflushed at the time of the query? Basically one
>         transaction per sink ?
>
>         Thanks,
>         Hari
>
>
>         On Wed, Jan 21, 2015 at 6:42 PM, Jay Alexander
>         <zyacer@gmail.com <mailto:zyacer@gmail.com>> wrote:
>
>             I used *flume-ng 1.5* version to collect logs.
>
>             There are two agents in the data flow and they are on two
>             hosts, respectively.
>
>             And the data is sended *from agent1 to agent2.*
>
>             The agents's component is as follows:
>
>             agent1: spooling dir source --> file channel --> avro sink
>             agent2: avro source --> file channel --> hdfs sink
>
>             But it seems to loss data about 1/1000 percentage of
>             million data.To solve problem I tried these steps:
>
>              1. look up agents log: cannot find any error or exception.
>              2. look up agents monitor metrics: the events number that
>                 put and take from channel always equals
>              3. statistic the data number by hive query and hdfs file
>                 use shell, respectively: the two number is equal and
>                 less than the online data number
>
>
>             These are the two agents configuration:
>
>                 #agent1
>                 agent1.sources = src_spooldir
>                 agent1.channels = chan_file
>                 agent1.sinks = sink_avro
>
>                 #source
>                 agent1.sources.src_spooldir.type = spooldir
>                 agent1.sources.src_spooldir.spoolDir =
>                 /data/logs/flume-spooldir
>                 agent1.sources.src_spooldir.interceptors=i1
>
>                 #interceptors
>                 agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
>                 agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
>                 agent1.sources.src_spooldir.interceptors.i1.serializers=s1
>                 agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name
>                 <http://agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name>=dt
>
>                 #sink
>                 agent1.sinks.sink_avro.type = avro
>                 agent1.sinks.sink_avro.hostname = 10.235.2.212
>                 agent1.sinks.sink_avro.port = 9910
>
>                 #channel
>                 agent1.channels.chan_file.type = file
>                 agent1.channels.chan_file.checkpointDir =
>                 /data/flume/agent1/checkpoint
>                 agent1.channels.chan_file.dataDirs =
>                 /data/flume/agent1/data
>
>                 agent1.sources.src_spooldir.channels = chan_file
>                 agent1.sinks.sink_avro.channel = chan_file
>
>
>
>                 # agent2
>                 agent2.sources  = source1
>                 agent2.channels = channel1
>                 agent2.sinks    = sink1
>
>                 # source
>                 agent2.sources.source1.type     = avro
>                 agent2.sources.source1.bind     = 10.235.2.212
>                 agent2.sources.source1.port     = 9910
>
>                 # sink
>                 agent2.sinks.sink1.type= hdfs
>                 agent2.sinks.sink1.hdfs.fileType = DataStream
>                 agent2.sinks.sink1.hdfs.filePrefix = log
>                 agent2.sinks.sink1.hdfs.path =
>                 hdfs://hnd.hadoop.jsh:8020/data/%{dt}
>                 agent2.sinks.sink1.hdfs.rollInterval = 600
>                 agent2.sinks.sink1.hdfs.rollSize = 0
>                 agent2.sinks.sink1.hdfs.rollCount = 0
>                 agent2.sinks.sink1.hdfs.idleTimeout = 300
>                 agent2.sinks.sink1.hdfs.round = true
>                 agent2.sinks.sink1.hdfs.roundValue = 10
>                 agent2.sinks.sink1.hdfs.roundUnit = minute
>
>                 # channel
>                 agent2.channels.channel1.type   = file
>                 agent2.channels.channel1.checkpointDir =
>                 /data/flume/agent2/checkpoint
>                 agent2.channels.channel1.dataDirs =
>                 /data/flume/agent2/data
>                 agent2.sinks.sink1.channel      = channel1
>                 agent2.sources.source1.channels = channel1
>
>
>             Any suggestions are welcome!
>
>
>
>


--------------080300040002040104040701
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    1: In agent1, there is a "regex_extractor" interceptor for
    extracting header "dt"<br>
    <blockquote>
      <div>
        <div>#interceptors</div>
      </div>
      <div>
        <div>agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor</div>
      </div>
      <div>
        <div>agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*</div>
      </div>
      <div>
        <div>agent1.sources.src_spooldir.interceptors.i1.serializers=s1</div>
      </div>
      <a moz-do-not-send="true"
href="http://agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name">agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name</a>=dt<br>
    </blockquote>
    <div>
      <div> in agent2, the hdfs sink use the header in the path, this is
        the configurations:<br>
      </div>
    </div>
    <blockquote>agent2.sinks.sink1.hdfs.path =
      hdfs://hnd.hadoop.jsh:8020/data/%{dt}<br>
      <br>
    </blockquote>
    2: I misunderstood this property, thank you for revision.<br>
    <br>
    Thanks,<br>
    Alex<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 1/22/2015 12:51 PM, Hari Shreedharan
      wrote:<br>
    </div>
    <blockquote cite="mid:1421902307657.ef968fce@Nodemailer" type="cite">
      <span id="mailbox-conversation">
        <div>1: How do you guarantee that the data from the previous day
          has not spilled over to the next day? Where are you inserting
          the timestamp (if you are doing bucketing).</div>
        <div>2: Flume creates transactions for writes. Each batch
          defaults to 1000 events, which are written  and flushed.
           There is still only one transaction per sink, the pool size
          is for IO ops.</div>
      </span>
      <div class="mailbox_signature"> <br>
        Thanks, 
        <div>Hari</div>
      </div>
      <br>
      <br>
      <div class="gmail_quote">
        <p>On Wed, Jan 21, 2015 at 7:32 PM, Jay Alexander <span
            dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:zyacer@gmail.com" target="_blank">zyacer@gmail.com</a>&gt;</span>
          wrote:<br>
        </p>
        <blockquote class="gmail_quote" style="margin:0 0 0
          .8ex;border-left:1px #ccc solid;padding-left:1ex;">
          <div>
            <div dir="ltr">First Question: No, I query the all the file
              in hdfs had been closed, exactly I account the data one
              day later. 
              <div> <br>
                <div>Second Question: I hadn't config any about the
                  transaction. And I saw there is an item in the hdfs
                  sink configuration:"hdfs.threadsPoolSize<span class=""
                    style="white-space:pre"> </span>10<span class=""
                    style="white-space:pre"> </span>Number of threads
                  per HDFS sink for HDFS IO ops (open, write, etc.)". </div>
                <div>So there is 10 transactions per sink from the file
                  channel.</div>
                <div><br>
                </div>
                <div>Thanks.</div>
                <div>
                  <div>
                    <div> <br>
                      <div class="gmail_extra"> <br>
                        <div class="gmail_quote">2015-01-22 11:04
                          GMT+08:00 Hari Shreedharan <span dir="ltr">&lt;<a
                              moz-do-not-send="true"
                              href="mailto:hshreedharan@cloudera.com">hshreedharan@cloudera.com</a>&gt;</span>:<br>
                          <blockquote class="gmail_quote"
                            style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>Are

                              you accounting for the data still being
                              written but not yet hflushed at the time
                              of the query? Basically one transaction
                              per sink ?</span>
                            <div> <br>
                              Thanks, <span class=""><font
                                  color="#888888">
                                  <div>Hari</div>
                                </font></span> </div>
                            <div class="">
                              <div class="h5"> <br>
                                <br>
                                <div class="gmail_quote">
                                  <p>On Wed, Jan 21, 2015 at 6:42 PM,
                                    Jay Alexander <span dir="ltr">&lt;<a
                                        moz-do-not-send="true"
                                        href="mailto:zyacer@gmail.com">zyacer@gmail.com</a>&gt;</span>
                                    wrote:<br>
                                  </p>
                                  <blockquote class="gmail_quote"
                                    style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
                                    <div>
                                      <div dir="ltr">I used <b>flume-ng
                                          1.5</b> version to collect
                                        logs.<br>
                                        <br>
                                        There are two agents in the data
                                        flow and they are on two hosts,
                                        respectively.<br>
                                        <br>
                                        And the data is sended <b>from
                                          agent1 to agent2.</b><br>
                                        <br>
                                        The agents's component is as
                                        follows:
                                        <div> <br>
                                          agent1: spooling dir source
                                          --&gt; file channel --&gt;
                                          avro sink<br>
                                          agent2: avro source --&gt;
                                          file channel --&gt; hdfs sink<br>
                                          <br>
                                          But it seems to loss data
                                          about 1/1000 percentage of
                                          million data.To solve problem
                                          I tried these steps:</div>
                                        <div>
                                          <ol>
                                            <li>look up agents log:
                                              cannot find any error or
                                              exception.</li>
                                            <li>look up agents monitor
                                              metrics: the events number
                                              that put and take from
                                              channel always equals<br>
                                            </li>
                                            <li>statistic the data
                                              number by hive query and
                                              hdfs file use shell,
                                              respectively: the two
                                              number is equal and less
                                              than the online data
                                              number<br>
                                            </li>
                                          </ol>
                                          <br>
                                          These are the two agents
                                          configuration:</div>
                                        <blockquote style="margin:0px
                                          0px 0px
                                          40px;border:none;padding:0px">
                                          <div>
                                            <div>#agent1</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources =
                                              src_spooldir</div>
                                          </div>
                                          <div>
                                            <div>agent1.channels =
                                              chan_file</div>
                                          </div>
                                          <div>
                                            <div>agent1.sinks =
                                              sink_avro</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div>#source</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.type

                                              = spooldir</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.spoolDir

                                              =
                                              /data/logs/flume-spooldir</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.interceptors=i1</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div>#interceptors</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*</div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.interceptors.i1.serializers=s1</div>
                                          </div>
                                          <div>
                                            <div> <a
                                                moz-do-not-send="true"
href="http://agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name">agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name</a>=dt</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div>#sink</div>
                                          </div>
                                          <div>
                                            <div>agent1.sinks.sink_avro.type

                                              = avro</div>
                                          </div>
                                          <div>
                                            <div>agent1.sinks.sink_avro.hostname

                                              = 10.235.2.212</div>
                                          </div>
                                          <div>
                                            <div>agent1.sinks.sink_avro.port

                                              = 9910</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div>#channel</div>
                                          </div>
                                          <div>
                                            <div>agent1.channels.chan_file.type

                                              = file</div>
                                          </div>
                                          <div>
                                            <div>agent1.channels.chan_file.checkpointDir

                                              =
                                              /data/flume/agent1/checkpoint</div>
                                          </div>
                                          <div>
                                            <div>agent1.channels.chan_file.dataDirs

                                              = /data/flume/agent1/data</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div>agent1.sources.src_spooldir.channels

                                              = chan_file</div>
                                          </div>
                                          <div>
                                            <div>agent1.sinks.sink_avro.channel

                                              = chan_file</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div><br>
                                          </div>
                                          <div><br>
                                          </div>
                                        </blockquote>
                                        <blockquote style="margin:0px
                                          0px 0px
                                          40px;border:none;padding:0px">
                                          <div>
                                            <div># agent2 </div>
                                          </div>
                                          <div>
                                            <div>agent2.sources  =
                                              source1</div>
                                          </div>
                                          <div>
                                            <div>agent2.channels =
                                              channel1 </div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks    =
                                              sink1 </div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div># source</div>
                                          </div>
                                          <div>
                                            <div>agent2.sources.source1.type

                                                  = avro</div>
                                          </div>
                                          <div>
                                            <div>agent2.sources.source1.bind

                                                  = 10.235.2.212</div>
                                          </div>
                                          <div>
                                            <div>agent2.sources.source1.port

                                                  = 9910</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div># sink</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.type=

                                              hdfs</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.fileType

                                              = DataStream</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.filePrefix

                                              = log</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.path

                                              =
                                              hdfs://hnd.hadoop.jsh:8020/data/%{dt}</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.rollInterval

                                              = 600</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.rollSize

                                              = 0</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.rollCount

                                              = 0</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.idleTimeout

                                              = 300</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.round

                                              = true</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.roundValue

                                              = 10</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.hdfs.roundUnit

                                              = minute</div>
                                          </div>
                                          <div>
                                            <div><br>
                                            </div>
                                          </div>
                                          <div>
                                            <div># channel</div>
                                          </div>
                                          <div>
                                            <div>agent2.channels.channel1.type

                                                = file</div>
                                          </div>
                                          <div>
                                            <div>agent2.channels.channel1.checkpointDir

                                              =
                                              /data/flume/agent2/checkpoint</div>
                                          </div>
                                          <div>
                                            <div>agent2.channels.channel1.dataDirs

                                              = /data/flume/agent2/data</div>
                                          </div>
                                          <div>
                                            <div>agent2.sinks.sink1.channel

                                                   = channel1</div>
                                          </div>
                                          <div>
                                            <div>agent2.sources.source1.channels

                                              = channel1</div>
                                          </div>
                                        </blockquote>
                                        <div><br>
                                        </div>
                                        <div>
                                          <p
                                            style="color:rgb(0,0,0);font-family:Simsun;font-size:medium">Any

                                            suggestions are welcome! </p>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
                                <br>
                              </div>
                            </div>
                          </blockquote>
                        </div>
                        <br>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
      <br>
    </blockquote>
    <br>
  </body>
</html>

--------------080300040002040104040701--