nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Petronic <markpetro...@gmail.com>
Subject Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected
Date Mon, 26 Oct 2015 05:23:24 GMT
Joe, yes, I wanted to be able to selectively unzip a specific file
from a zip archive. For example, I have this zip archive and want to
just pull all files that match *LMTD* from it to standard out as a
stream to feed into hdfs as a file put. Since there are a bunch of big
files there, it is really wasteful to network I/O to have to stream
the whole file file just to throw away most of the bits in a later
filter stage just to end up with some part of the bits. I like
efficiency where it makes sense and there is already a lot of I/O from
Hadoop - no need to add more unnecessary stuff that could be easily
avoided. :)

unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
Archive:  /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
 73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
 80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
 14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
   120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
---------                     -------
168185188                     4 files

On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <joe.witt@gmail.com> wrote:
> Hello
>
> For the unpacking portion are you saying you have a single archive
> (let's say in zip format) and it contains multiple objects within.
> You'd like to be able to use UnpackContent but tell it you'd like to
> skip or include specific items based on a regex or something against
> the names?
>
> That seems reasonable to do but just wanted to make sure I understood.
> For now you can put a RouteOnAttribute processor after Unpack and just
> route to throw away unbundled items you don't care about.  You can
> create a property on that processor called 'stuff-i-dont-want' and the
> value would be something like
> ${filename:matches('*stuff-i-dont-want*')}.
>
> Thanks
> Joe
>
> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <adamonduty@gmail.com> wrote:
>> Mark,
>>
>>> If I configured the command arguments as
>> "-n +2" (without the quotes and space between the two parts), the
>> command would result in a "tail -n2" behavior.
>>
>> If you look at the tooltip for the Command Arguments property in
>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>> NiFi, but I've seen similar behavior with regard to spaces in libraries that
>> execute processes with command line arguments.
>>
>> There probably is a better way to process the CSV, but I'm afraid someone
>> else will need to comment on that.
>>
>>> Seems like it will only unzip the
>> whole zip file and provide me index numbers for each file unpacked.
>>
>> A quick look at the UnpackContent source [1] suggests that there is no way
>> to filter the filenames inside the zipfile prior to extraction. I agree that
>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>> possibility of including it as a feature in the future.
>>
>> Cheers,
>> Adam
>>
>>
>> [1]
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>
>>
>>
>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>
>>> Just starting to use Nifi and built a flow that implements the following:
>>>
>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>> /some/hdfs/file
>>>
>>> I used the following processor flow:
>>>
>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>> CompressContent(gzip) -> PutHDFS
>>>
>>> Couple questions/observations:
>>>
>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>> part. I need that to strip the header line off of CSV files. I did not
>>> see a simple way using a specific processor to strip off the first
>>> line of a flow file. Is there a better way? But, I did notice a very
>>> odd behavior of this command. If I configured the command arguments as
>>> "-n +2" (without the quotes and space between the two parts), the
>>> command would result in a "tail -n2" behavior. So, instead of giving
>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>> expected. I believe with is confusing to the user. Both forms work
>>> perfectly from the bash command line but only one works in Nifi?
>>> Anyone care to comment on this? Should there be an enhancement to
>>> remove this sort of inconsistent behavior?
>>>
>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>> files (the one that matches *LMTD*), I did not see a way to do that
>>> using the UnpackContent processor. Seems like it will only unzip the
>>> whole zip file and provide me index numbers for each file unpacked.
>>> This would be quite inefficient in my case because there are a number
>>> of large files inside the zip file and I only need one. So, seems like
>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>> to see if there are any other ideas on how to do this?
>>>
>>> Thanks in advance for thoughts on this
>>
>>

Mime
View raw message