Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (athena.apache.org: domain of otis.gospodnetic@gmail.com
 designates 209.85.192.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJmzdXkY_-CjK2K=W_2rK1OrtecFR1drP0x7ya4atsyGsFrp_A@mail.gmail.com>
References: 
 <CAHXz3_Foz7kcTwKJ7OoccqzU6R=MB6SxhzteVR02jHqZaP+9Vg@mail.gmail.com>
	<CAJmzdXncPTp3MskkQfWMSTr2a-XYKTPuNs83a_g+iZzH1+p_fg@mail.gmail.com>
	<CAHXz3_ETSjJw9mXVADUBUiv=SB_z6OkxtDJxvHaALOAHXw4Odw@mail.gmail.com>
	<CAJmzdXnP+CSRnwkWBwwCMy+y7BeBgaJ3=X=GBKEjRTnqj1hmLw@mail.gmail.com>
	<CAEaSAAffSxxyGn4v7eBaMinc6SpkvXzhDJpypXdk41cpNtOTFw@mail.gmail.com>
	<CAHXz3_GijVvorEST6=TMO1wd2TQeAt9ZZnFkOSweF_O2Dw-+pw@mail.gmail.com>
	<CAEaSAAeYxwkvZfYcfs1ru2BA6WieP+NX6iRwxGufgTFpwCeizw@mail.gmail.com>
	<CAJmzdXkY_-CjK2K=W_2rK1OrtecFR1drP0x7ya4atsyGsFrp_A@mail.gmail.com>
Date: Wed, 23 Apr 2014 09:48:18 -0400
Message-ID: 
 <CANNBgP+_afxLNuG7rWaLUKAxRop=oEYiL1vOfWqvMMfpOMT7xQ@mail.gmail.com>
Subject: Re: Import files from a directory on remote machine
From: Otis Gospodnetic <otis.gospodnetic@gmail.com>
To: user@flume.apache.org
Content-Type: multipart/alternative; boundary=001a11c11c742e916404f7b5fc50

--001a11c11c742e916404f7b5fc50
Content-Type: text/plain; charset=UTF-8

Hi Jeff,

On Thu, Apr 17, 2014 at 1:11 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Using the exec source with a tail -f is not considered a production
> solution.
> It mainly exists for testing purposes.
>

This statement surprised me.  Is that the general consensus among Flume
developers or users or at Cloudera?

Is there an alternative recommended for production that provides equivalent
functionality?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


>
>
> On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <
> laurance.w.george@gmail.com> wrote:
>
>> If you can NFS mount that directory to your local machine with flume it
>> sounds like what you've listed out would work well.
>>
>>
>> On Thu, Apr 17, 2014 at 2:54 AM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> If I am going to 'rsync' a file from remote host & copy it to hdfs via
>>> Flume, then why use Flume?  I can rsync & then just do a 'hadoop fs -put',
>>> no?  I must be missing something.  I guess, the only benefit of using Flume
>>> is that I can add Interceptors if I want to.  Current requirements don't
>>> need that.  We just want to copy data as is.
>>>
>>> Here's the real use case:   An application is writing to xyz.log file.
>>> Once this file gets over certain size it gets rolled over to xyz1.log & so
>>> on.  Kinda like Log4j.  What we really want is as soon as a line gets
>>> written to xyz.log, it should go to HDFS via Flume.
>>>
>>> Can I do something like this?
>>>
>>> 1)  Share the log directory under Linux.
>>> 2)  Use
>>> test1.sources.mylog.type = exec
>>> test1.sources.mylog.command = tail -F /home/user1/shares/logs/xyz.log
>>>
>>> I believe this will work, but is this the right way?  Thanks for your
>>> help.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 16, 2014 at 5:51 PM, Laurance George <
>>> laurance.w.george@gmail.com> wrote:
>>>
>>>> Agreed with Jeff.  Rsync + cron ( if it needs to be regular) is
>>>> probably your best bet to ingest files from a remote machine that you only
>>>> have read access to.  But then again you're sorta stepping outside of the
>>>> use case of flume at some level here as rsync is now basically a part of
>>>> your flume topology.  However, if you just need to back-fill old log data
>>>> then this is perfect!  In fact, it's what I do myself.
>>>>
>>>>
>>>> On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <jlord@cloudera.com> wrote:
>>>>
>>>>> The spooling directory source runs as part of the agent.
>>>>> The source also needs write access to the files as it renames them
>>>>> upon completion of ingest. Perhaps you could use rsync to copy the files
>>>>> somewhere that you have write access to?
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2014 at 5:26 PM, Something Something <
>>>>> mailinglists19@gmail.com> wrote:
>>>>>
>>>>>> Thanks Jeff.  This is useful.  Can the spoolDir be on a different
>>>>>> machine?  We may have to setup a different process to copy files into
>>>>>> 'spoolDir', right?  Note:  We have 'read only' access to these files.  Any
>>>>>> recommendations about this?
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <jlord@cloudera.com>wrote:
>>>>>>
>>>>>>> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2014 at 5:14 PM, Something Something <
>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Needless to say I am newbie to Flume, but I've got a basic flow
>>>>>>>> working in which I am importing a log file from my linux box to hdfs.  I am
>>>>>>>> using
>>>>>>>>
>>>>>>>> a1.sources.r1.command = tail -F /var/log/xyz.log
>>>>>>>>
>>>>>>>> which is working like a stream of messages.  This is good!
>>>>>>>>
>>>>>>>> Now what I want to do is copy log files from a directory on a
>>>>>>>> remote machine on a regular basis.  For example:
>>>>>>>>
>>>>>>>> username@machinename:/var/log/logdir/<multiple files>
>>>>>>>>
>>>>>>>> One way to do it is to simply 'scp' files from the remote directory
>>>>>>>> into my box on a regular basis, but what's the best way to do this in
>>>>>>>> Flume?  Please let me know.
>>>>>>>>
>>>>>>>> Thanks for the help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Laurance George
>>>>
>>>
>>>
>>
>>
>> --
>> Laurance George
>>
>
>

--001a11c11c742e916404f7b5fc50
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Jeff,<div class=3D"gmail_extra"><br></div><div class=3D=
"gmail_extra"><div class=3D"gmail_quote">On Thu, Apr 17, 2014 at 1:11 PM, J=
eff Lord <span dir=3D"ltr">&lt;<a href=3D"mailto:jlord@cloudera.com" target=
=3D"_blank">jlord@cloudera.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Using the exec source with a tail -f is n=
ot considered a production solution.<div>
It mainly exists for testing purposes.</div></div></blockquote><div><br></d=
iv><div>This statement surprised me. =C2=A0Is that the general consensus am=
ong Flume developers or users or at Cloudera?<br></div><div><br></div><div>=
Is there an alternative recommended for production that provides equivalent=
 functionality?</div>
<div><br></div><div>Thanks,</div><div>Otis<br>--<br>Performance Monitoring =
* Log Analytics * Search Analytics<br>Solr &amp; Elasticsearch Support *=C2=
=A0<a href=3D"http://sematext.com/" target=3D"_blank">http://sematext.com/<=
/a></div>
<br><div><br></div><div><br></div><div>=C2=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-le=
ft-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div cl=
ass=3D"">
<div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_qu=
ote">On Thu, Apr 17, 2014 at 7:03 AM, Laurance George <span dir=3D"ltr">&lt=
;<a href=3D"mailto:laurance.w.george@gmail.com" target=3D"_blank">laurance.=
w.george@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">If you can NFS mount that directory to yo=
ur local machine with flume it sounds like what you&#39;ve listed out would=
 work well. =C2=A0=C2=A0 <br>

</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Thu, Apr 17, 2014 at 2:54 AM, Something Something <span dir=3D"ltr">&=
lt;<a href=3D"mailto:mailinglists19@gmail.com" target=3D"_blank">mailinglis=
ts19@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div>If I am going to=
 &#39;rsync&#39; a file from remote host &amp; copy it to hdfs via Flume, t=
hen why use Flume?=C2=A0 I can rsync &amp; then just do a &#39;hadoop fs -p=
ut&#39;, no?=C2=A0 I must be missing something.=C2=A0 I guess, the only ben=
efit of using Flume is that I can add Interceptors if I want to.=C2=A0 Curr=
ent requirements don&#39;t need that.=C2=A0 We just want to copy data as is=
.<br>


<br></div>Here&#39;s the real use case:=C2=A0=C2=A0 An application is writi=
ng to xyz.log file.=C2=A0 Once this file gets over certain size it gets rol=
led over to xyz1.log &amp; so on.=C2=A0 Kinda like Log4j.=C2=A0 What we rea=
lly want is as soon as a line gets written to xyz.log, it should go to HDFS=
 via Flume.<br>


<br></div>Can I do something like this?<br><br></div>1)=C2=A0 Share the log=
 directory under Linux.<br></div>2)=C2=A0 Use<br>test1.sources.mylog.type =
=3D exec<br>test1.sources.mylog.command =3D tail -F /home/user1/shares/logs=
/xyz.log<br>


<br></div>I believe this will work, but is this the right way?=C2=A0 Thanks=
 for your help.<br><br><div><br><div><div><div><br></div></div></div></div>=
</div><div><div><div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">On Wed, Apr 16, 2014 at 5:51 PM, Laurance George=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:laurance.w.george@gmail.com" targe=
t=3D"_blank">laurance.w.george@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Agreed with Jeff.=C2=A0 Rsync + cron ( if=
 it needs to be regular) is probably your best bet to ingest files from a r=
emote machine that you only have read access to.=C2=A0 But then again you&#=
39;re sorta stepping outside of the use case of flume at some level here as=
 rsync is now basically a part of your flume topology.=C2=A0 However, if yo=
u just need to back-fill old log data then this is perfect!=C2=A0 In fact, =
it&#39;s what I do myself.=C2=A0 <br>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Wed, Apr 16, 2014 at 8:46 PM, Jeff Lord <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jlord@cloudera.com" target=3D"_blank">jlord@cloudera.com</a>&gt;=
</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">The spooling directory source runs as par=
t of the agent.<div>
The source also needs write access to the files as it renames them upon com=
pletion of ingest. Perhaps you could use rsync to copy the files somewhere =
that you have write access to?</div>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Wed, Apr 16, 2014 at 5:26 PM, Something Something <span dir=3D"ltr">&=
lt;<a href=3D"mailto:mailinglists19@gmail.com" target=3D"_blank">mailinglis=
ts19@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Thanks Jeff.=C2=A0 This is useful.=C2=A0 =
Can the spoolDir be on a different machine?=C2=A0 We may have to setup a di=
fferent process to copy files into &#39;spoolDir&#39;, right?=C2=A0 Note:=
=C2=A0 We have &#39;read only&#39; access to these files.=C2=A0 Any recomme=
ndations about this?<br>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Wed, Apr 16, 2014 at 5:16 PM, Jeff Lord <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jlord@cloudera.com" target=3D"_blank">jlord@cloudera.com</a>&gt;=
</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr"><a href=3D"http://flume.apache.org/FlumeU=
serGuide.html#spooling-directory-source" target=3D"_blank">http://flume.apa=
che.org/FlumeUserGuide.html#spooling-directory-source</a><br>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">
On Wed, Apr 16, 2014 at 5:14 PM, Something Something <span dir=3D"ltr">&lt;=
<a href=3D"mailto:mailinglists19@gmail.com" target=3D"_blank">mailinglists1=
9@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(2=
04,204,204);border-left-style:solid;padding-left:1ex">


<div dir=3D"ltr"><div><div><div><div>Hello,<br><br></div>Needless to say I =
am newbie to Flume, but I&#39;ve got a basic flow working in which I am imp=
orting a log file from my linux box to hdfs.=C2=A0 I am using<br><br></div>=
<div>


a1.sources.r1.command =3D tail -F /var/log/xyz.log<br><br></div><div>which =
is working like a stream of messages.=C2=A0 This is good!<br></div><div><br=
></div>Now what I want to do is copy log files from a directory on a remote=
 machine on a regular basis.=C2=A0 For example:<br>


<br></div>username@machinename:/var/log/logdir/&lt;multiple files&gt;<br><b=
r></div><div>One way to do it is to simply &#39;scp&#39; files from the rem=
ote directory into my box on a regular basis, but what&#39;s the best way t=
o do this in Flume?=C2=A0 Please let me know.<br>


</div><br>Thanks for the help.<br><br><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><br></div></div><span>=
<font color=3D"#888888">-- <br>Laurance George
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><br></div></div><span>=
<font color=3D"#888888">-- <br>Laurance George
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--001a11c11c742e916404f7b5fc50--