Thank you for your suggestion, Andy and Lee.
I am aware of the flow using ListFile-FetchFile-HashContent. I didn’t go for that route
because the ListFile processor does not allow upstream processor. I have an upstream processor,
from which I know the directory I want to work with. I end up to passing the directory name
into the ExecuteStreamCommand processor to get ALL the files under the directory. After that
I use SplitText and ExtractText to filter the files with the desired file extension, and then
I use FetchFile and HashContent to finish what I want to do.
If ListFile allows upstream input, it would have make my data flow much easier. The same
goes for the ListSFTP processor.
Huagen
> 在 2016年5月31日,下午2:56,Lee Laim <lee.laim@gmail.com> 写道:
>
> Huagen,
>
> I had a similar workflow and eventually replaced ExecuteStreamCommand(md5sum) with HashContent.
>
> Using ListFile->FetchFile->HashContent, the resultant hash is placed into the
flowfile under the attribute ${hash.value}.
> This processor offers ~40 algorithms to choose from, including md5. Compared to the
ExecuteStreamCommand, the HashContent processor offers a bit more in error-handling and lineage
traceability in this specific case.
>
> Thanks,
> -Lee
>
>
> On Tue, May 31, 2016 at 11:24 AM, Andy LoPresto <alopresto@apache.org <mailto:alopresto@apache.org>>
wrote:
> Huagen,
>
> The ExecuteStreamCommand is used to run a command against the contents of an incoming
flowfile. For example, you could have a ListFile processor listing all .gz files in the directory
and passing them to the ExecuteStreamCommand processor to generate the MD5 hash of each. In
this case, you would not need a wildcard character in the command.
>
> The configuration for the processors is as follows:
>
> ListFile:
> -Input directory: <the directory where the files are located>
> -File Filter: [^\.]\.gz
>
> ExecuteStreamCommand:
> -Command arguments: ${filename}
> -Command path: md5
> -Working Directory: <the directory where the files are located>
> -Output Destination Attribute: md5hash
>
> Notes:
> -I am using “md5” rather than “md5sum” as I am on Mac OS X.
> -You could use the “-n” flag for “md5” to suppress extraneous output
> -You could use “${absolute.path}/${filename}” as the command arguments, in which
case you would not need to set the working directory
>
> Andy LoPresto
> alopresto@apache.org <mailto:alopresto@apache.org>
> alopresto.apache@gmail.com <mailto:alopresto.apache@gmail.com>
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
>
>> On May 31, 2016, at 7:02 AM, Huagen peng <huagen.peng@gmail.com <mailto:huagen.peng@gmail.com>>
wrote:
>>
>> Hi, I would like to run a md5sum command on all the *.gz files under a certain directory.
However, I keep getting this error:
>> md5sum: stat '/tmp/transfer/16-05-22_00/*.gz': No such file or directory
>>
>> I tried quoting the * wild character, adding a . dot or / in front with no avail.
Can I do something like this with the ExecuteStreamCommand processor?
>>
>> Thanks.
>
>
|