nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: Use Case...Please help
Date Mon, 23 May 2016 12:55:11 GMT
Hi Deepak,

It looks like your flow is following the right kind of direction, so I suspect there’s something
about the path that isn’t working out. One solution would be to use a mapped drive on your
machine, which makes it a little simpler, however, it would be nice if we could get it working
with the unc path as well. 
Are you getting any validation messages on the ListFile processor, either on the bulletin
board in Nifi () or in the nifi-app.log file? 

Note that you will have to be connected to the drive to ensure you have credentials, or have
you nifi user be able to connect to that drive with its windows credentials. There isn’t
currently a means to provide authentication per share in the processor, but nifi should inherit
the credential context of whichever user is running the nifi process. 

Hope that helps, 
Simon


> On 22 May 2016, at 18:40, Tripathi, Shiv Deepak <shiv.deepak.tripathi@philips.com>
wrote:
> 
> Hi Mark,
>  
> In order to implement apache nifi.
>  
> I downloaded hortonworks sandbox and installed apache nifi on that. Its working fine
in below scenario.
>  
> Scenario 1: My input directory is in local file system on HDP(screenshot name “listfilelocaldir”)
and output is on HDFS file system.
>  
> For all processor in dataflow please see screenshot – “HDP sandbox local to HDFS”
>  
> Scenario 2: could you please tell me which processor and in what order I need to use
if I want to send file from\\btc7n001\Ongoing-MR\MRI\Deepak <file://///btc7n001/Ongoing-MR/MRI/Deepak>
 (password enable network drive mapped to my machine) to HDP cluster created in VMplayer.
>  
> Its not recognizing the input directory at all. Please see the screenshot name-“Usecaseinputdir.jpeg”
>  
> Please help me.
>  
> Thanks,
> Deepak
>  
>  
> From: Mark Payne [mailto:markap14@hotmail.com <mailto:markap14@hotmail.com>] 
> Sent: Monday, May 16, 2016 6:19 PM
> To: users@nifi.apache.org <mailto:users@nifi.apache.org>
> Subject: Re: Use Case...Please help
>  
> Deepak,
>  
> Yes, you should be able to do so.
>  
> Thanks
> -Mark
>  
> On May 16, 2016, at 8:44 AM, Tripathi, Shiv Deepak <shiv.deepak.tripathi@philips.com
<mailto:shiv.deepak.tripathi@philips.com>> wrote:
>  
> Thanks a lot Mark.
>  
> Looking forward to try it out.
>  
> If I understood correctly than I can drop log copying script and staging machine and
can directly pull the logs from repository.
>  
> Please confirm.
>  
> Thanks,
> Deepak
>  
> From: Mark Payne [mailto:markap14@hotmail.com <mailto:markap14@hotmail.com>] 
> Sent: Monday, May 16, 2016 5:06 PM
> To: users@nifi.apache.org <mailto:users@nifi.apache.org>
> Subject: Re: Use Case...Please help
>  
> Deepak,
>  
> Thanks for providing such a detailed description of your use case. I think NiFi would
be an excellent
> tool to help you out here!
>  
> As I mentioned before, you would typically use ListFile -> FetchFile to pull the data
in. Clearly, here,
> though, you want to be more selective about what you pull in. You can accomplish this
by using a
> RouteOnAttribute processor. So you'd have something like: ListFile -> RouteOnAttribute
-> FetchFile.
> The RouteOnAttribute processor is very powerful and allows you to configure how to route
each piece
> of data based on whatever attributes are available. The ListFile Processor adds the following
attributes
> to each piece of data that it pulls in:
>  
> filename (name of file)
> path (relative path of file)
> absolute.path (absolute directory of file)
> fs.owner (owner of the file)
> fs.group (group that the file belongs to)
> fs.lastModified (last modified date)
> fs.length (file length)
> fs.permissions (file permissions, such as rw-rw-r--)
>  
> From these, you can make all sorts of routing decisions, based on name, timestamp, etc.
You can choose
> to terminate data that does not meet your criteria.
>  
> When you use FetchFile, you have the option of deleting the source file, moving it elsewhere,
or leaving
> it as-is. So you wouldn't need to delete it if you don't want to. This is possible because
ListFile keeps track
> of what has been 'listed'. So it won't ingest duplicate data, but it will pick up new
files (if any existing
> file is modified, it will pick up the new version of the file.)
>  
> You can then use UnpackContent if you want to unzip the data, or you can leave it zipped.
After the FetchFile,
> you can also use a RouteOnAttribute processor to separate out the XML from the log files
and put those to
> different directories in HDFS.
>  
> Does this sound like it will provide you all that you need?
>  
> Thanks
> -Mark
>  
>  
>  
> On May 16, 2016, at 3:06 AM, Tripathi, Shiv Deepak <shiv.deepak.tripathi@philips.com
<mailto:shiv.deepak.tripathi@philips.com>> wrote:
>  
> Hi Mark,
>  
> I am very happy to see the detailed reply. I am very thankful to you. So explaining more
about my use case below.
>  
> 1-      Screenshot Nameà “Stagingdirectory_copiedfiles”
>  
> Log copying script will copy the log in the  staging directory which is in my case “D:\PaiValidation”
and will maintain multiple folders. These folders are nothing but device serial no. Every
serial no will have multiple log files and xml files as on  Each day one new log files used
to come in this directory, as you can see.
>  
> In log copy script we defined that how many days logs we want. So lets say we passed
360 , so it will copy logs from last 360 days and as it is continuously running so after 10
days of when you passed configuration very first time it will have logs 360(last 360 days
from the time when you passed this parameter to script) +10 days+ growing day by day= 370++++++
>  
> And after pushing the files to cluster we are renaming or creating dummy files with 0
byte as you can see in screenshot
>  
> Also we are passing one more parameter which specifies the device serial no of which
we want logs not from all devices.
>  
> 2-      The source repository
> Screenshot Name à “Repository files”
>  
> This is actual repository from where we are taking the logs and copying it to staging
directory. These are incoming logs from the device and every serial no is having multiple
types of files as you can see in screenshot.  We need only  log files with log************.zip
pattern and xml file and rest of them we will not pick up. Also these logs in repository we
can’t delete.
>  
>  
> 3-      HDFS directory
>  
> From staging directory flume is  moving files to our on HDFS premise cluster
> Screenshot Nameà HDFS1
>  
> You can see two highlighted folder in this screenshot one will have only log files other
will have xml files.
> If you go back and see the “Stagingdirectory_copiedfiles” you will find xml and log
files under same device serial no which we are storing separately in cluster
>  
> Screenshot name à hdfs2
>  
> Logs will be stored under same directory structure as it was in staging. For both xml
files and log files.
>  
>  
>  
> So if I want to accomplish above goals nifi will be best solution?
>  
> If I use nifi directly to the repository to pull the logs whether I can be able to do
these few things:
>  
> 1-      It should not copy duplicate logs as from destination we will be deleting logs.
> 2-      It should only copy the logs of last 20 days or last 50 days like any days and
if the new logs comes in directory each day it should pull up that too.
> 3-      It should not delete any logs from the source repository.
> 4-      It should copy specified logs in one directory and xml in other directory in
HDFS.
>  
> In such a case we can remove the concept of script.
>  
> Hoping for best.
>  
> Thanks,
> Deepak
>  
>  
>  
> From: Mark Payne [mailto:markap14@hotmail.com <mailto:markap14@hotmail.com>] 
> Sent: Monday, May 16, 2016 1:25 AM
> To: users@nifi.apache.org <mailto:users@nifi.apache.org>
> Subject: Re: Use Case...Please help
>  
> Hi Deepak,
>  
> Certainly, this is something that you could use NiFi for. We often see people using NiFi
to sync data from
> a directory on local disk to a directory in HDFS. This is typically accomplished by using
a flow like:
>  
> ListFile -> FetchFile -> PutHDFS
>  
> You can then create a file in the source directory with the same name by using ReplaceText
to set the content
> to nothing and then PutFile to write out the 0-byte content. So the flow would look like:
>  
> ListFile -> FetchFile -> PutHDFS -> ReplaceText -> PutFile
>  
> PutHDFS has a "Directory" property. If you set this value to "${path}" it will use the
same directory structure that
> ListFile found the file to be in when it performed the listing. I.e., if you set ListFile
to pull from /data/mydir
> and "Recurse Subdirectories" to true, then any file found in /data/mydir will have a
'path' of './' and anything found in
> /data/mydir/subdir1 will have a path of './subdir1'. If you would rather have the fully
qualified path (/data/mydir/subdir1)
> you would use "${absolute.path}" instead of "${path}"
>  
> One thing that I find curious about your scenario though is the concept of a 'log copy
script' and then putting back
> a 0-byte file so that the script does not pick up the data again. Why not just use NiFi
to pull directly from the source
> and avoid using a script all together? The ListFile processor will keep track of what
has been pulled in already,
> so it won't copy the data multiple times. But I may not be clear on this point. Is the
"Log repository" that you mention
> just a directory that NiFi could pull from, or is it some other sort of repository?
>  
> Thanks
> -Mark
>  
>  
>  
> On May 15, 2016, at 3:23 PM, Tripathi, Shiv Deepak <shiv.deepak.tripathi@philips.com
<mailto:shiv.deepak.tripathi@philips.com>> wrote:
>  
> Hi 
>  
> Currently I am using flume for data ingestion and my use case as follows
>  
> Log repository--------log copy Script-----à Staging directory for  copied logs
>  
> Staging directory for  copied logs folder structure----Machine1log----a.log
>                                                                                     
                                         -----b.log
>                                                                                     
               Machine2log----a.log
>                                                                                     
                                         -----b.log
>  
> Flume will copy these logs and replicate same structure in HDFS cluster. Beginning with
which is :
>                                                                                     
           /user/hdfs/Machine1log----a.log
>                                                                                     
                                                     -----b.log
>                                                                                     
                             Machine2log----a.log
>                                                                                     
                                                       -----b.log
>  
>  
> And creates 0 byte dummy file with same name so that Script wont copy the same log again
as it find 0 byte file already existing in source directory.
>  
>  
> Can we do same things with apache nifi?
>  
> Keeping in mind two goals- same folder structure in HDFS and after moving file to HDFS
it should crete 0 byte dummy file in source directory.
>  
>  
> Please help
>  
> Thanks,
> Deepak
>  
>  
>  
>  
> With Best Regards,
> Deepak Tripathi
> Philips Innovation campus
> Bangalore-560045
> <image001.png>
>  
>  
> The information contained in this message may be confidential and legally protected under
applicable law. The message is intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction
of this message is strictly prohibited and may be unlawful. If you are not the intended recipient,
please contact the sender by return e-mail and destroy all copies of the original message.
>  
> <repository files.jpg><Stagingdirectory_copiedfiles.jpg><HDFS1.jpg><hdfs2.jpg>
>  
> <listfilelocaldir.JPG><HDP sandbox local to HDFS.JPG><Usecaseinputdir.JPG>


Mime
View raw message