nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected
Date Mon, 26 Oct 2015 14:14:05 GMT
Mark

Ok understood.  I think ultimately in the case of ZIP the IO is
happening anyway but if we can avoid writing these items to our
repositories at all if they're uninteresting then great.  Do you mind
filing a JIRA for that?

And yes you are absolutely right that you should be able to expect/get
a consistent behavior between executecommand/script processors.  We
have discussed this before.  I didn't find a jira.  Anyone else know
the status of this?

Thanks
Joe

On Mon, Oct 26, 2015 at 1:23 AM, Mark Petronic <markpetronic@gmail.com> wrote:
> Joe, yes, I wanted to be able to selectively unzip a specific file
> from a zip archive. For example, I have this zip archive and want to
> just pull all files that match *LMTD* from it to standard out as a
> stream to feed into hdfs as a file put. Since there are a bunch of big
> files there, it is really wasteful to network I/O to have to stream
> the whole file file just to throw away most of the bits in a later
> filter stage just to end up with some part of the bits. I like
> efficiency where it makes sense and there is already a lot of I/O from
> Hadoop - no need to add more unnecessary stuff that could be easily
> avoided. :)
>
> unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
> Archive:  /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>   Length      Date    Time    Name
> ---------  ---------- -----   ----
>  73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
>  80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
>  14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
>    120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
> ---------                     -------
> 168185188                     4 files
>
> On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <joe.witt@gmail.com> wrote:
>> Hello
>>
>> For the unpacking portion are you saying you have a single archive
>> (let's say in zip format) and it contains multiple objects within.
>> You'd like to be able to use UnpackContent but tell it you'd like to
>> skip or include specific items based on a regex or something against
>> the names?
>>
>> That seems reasonable to do but just wanted to make sure I understood.
>> For now you can put a RouteOnAttribute processor after Unpack and just
>> route to throw away unbundled items you don't care about.  You can
>> create a property on that processor called 'stuff-i-dont-want' and the
>> value would be something like
>> ${filename:matches('*stuff-i-dont-want*')}.
>>
>> Thanks
>> Joe
>>
>> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <adamonduty@gmail.com> wrote:
>>> Mark,
>>>
>>>> If I configured the command arguments as
>>> "-n +2" (without the quotes and space between the two parts), the
>>> command would result in a "tail -n2" behavior.
>>>
>>> If you look at the tooltip for the Command Arguments property in
>>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>>> NiFi, but I've seen similar behavior with regard to spaces in libraries that
>>> execute processes with command line arguments.
>>>
>>> There probably is a better way to process the CSV, but I'm afraid someone
>>> else will need to comment on that.
>>>
>>>> Seems like it will only unzip the
>>> whole zip file and provide me index numbers for each file unpacked.
>>>
>>> A quick look at the UnpackContent source [1] suggests that there is no way
>>> to filter the filenames inside the zipfile prior to extraction. I agree that
>>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>>> possibility of including it as a feature in the future.
>>>
>>> Cheers,
>>> Adam
>>>
>>>
>>> [1]
>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>>
>>>
>>>
>>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>>
>>>> Just starting to use Nifi and built a flow that implements the following:
>>>>
>>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>>> /some/hdfs/file
>>>>
>>>> I used the following processor flow:
>>>>
>>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>>> CompressContent(gzip) -> PutHDFS
>>>>
>>>> Couple questions/observations:
>>>>
>>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>>> part. I need that to strip the header line off of CSV files. I did not
>>>> see a simple way using a specific processor to strip off the first
>>>> line of a flow file. Is there a better way? But, I did notice a very
>>>> odd behavior of this command. If I configured the command arguments as
>>>> "-n +2" (without the quotes and space between the two parts), the
>>>> command would result in a "tail -n2" behavior. So, instead of giving
>>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>>> expected. I believe with is confusing to the user. Both forms work
>>>> perfectly from the bash command line but only one works in Nifi?
>>>> Anyone care to comment on this? Should there be an enhancement to
>>>> remove this sort of inconsistent behavior?
>>>>
>>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>>> files (the one that matches *LMTD*), I did not see a way to do that
>>>> using the UnpackContent processor. Seems like it will only unzip the
>>>> whole zip file and provide me index numbers for each file unpacked.
>>>> This would be quite inefficient in my case because there are a number
>>>> of large files inside the zip file and I only need one. So, seems like
>>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>>> to see if there are any other ideas on how to do this?
>>>>
>>>> Thanks in advance for thoughts on this
>>>
>>>

Mime
View raw message