nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <>
Subject Re: Expression language
Date Fri, 13 Nov 2015 02:14:33 GMT
Under nifi-commons/nifi-utils/ there is a package called  That is where the good stuff is.  Wild
man Tony Kurc bringing the high speed search heat there.


On Thu, Nov 12, 2015 at 9:09 PM, Matt Burgess <> wrote:
> That is awesome to hear, I didn't realize ScanContent worked that way, very cool!
> Sent from my iPhone
>> On Nov 12, 2015, at 8:40 PM, Joe Witt <> wrote:
>> User Experience - everything we do needs to be about continually
>> improving the user experience.  So yes for sure if you've got ideas on
>> how to provide a more intuitive play - yes please.  You will find an
>> implementation of aho corasick under the standard processors
>> (ScanContent) and the associated library under search tools.
>> Amazingly fast.
>> Thanks!
>> Joe
>>> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <> wrote:
>>> Not sure if it would prove useful but I've started messing around with the
>>> Aho-Corasick algorithm in the hopes of the user being able to paste in some
>>> sample data and getting a regex out. If the data is "regular", the user
>>> wouldn't need to know an expression language, they would just need a
>>> representative sample of their data.
>>> Depending on how crazy I want to get, I might do cross-fold validation
>>> (rated against the algorithm on the whole set) for the sample input to see
>>> if it's really "regular" or that guessing a regex is just too hard for the
>>> given data.
>>> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be
>>> a valuable feature? The missing link is the translator from Finite State
>>> Machine (from Aho-Corasick) to the target model (regex or otherwise). The
>>> research has been done and there is code available (under GPL) so on purpose
>>> I did not read the paper or look at the source.
>>> Sorry in advance if I've gone too far afield here, I've just felt the pains
>>> of users trying to get the right recognizers for their data fields.
>>> Cheers,
>>> Matt
>>> Sent from my iPhone
>>> On Nov 12, 2015, at 7:54 PM, Joe Witt <> wrote:
>>> We have to make this easier...
>>> Maybe we should give someone access to an inline expression editor and see
>>> the results.  Like in regexpal...
>>>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <>
>>>> Good call.  I added trim() to the matches command, and it seems to have
>>>> resolved the issue.  I was checking for sane lengths, but maybe there was
>>>> \n or something in there.  Problem for another day.  Thanks.
>>>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke
>>>> <> wrote:
>>>>> Make sure your attribute name and value does not have white space on
>>>>> either side. A 'space' is a valid character and is often over looked.
>>>>> encoding" does not equal "encoding" or "encoding ". The same applies
for the
>>>>> attribute values.
>>>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <>
>>>>> wrote:
>>>>>> Thanks.  I did use the matches syntax already and checked the attribute
>>>>>> values in each processor using Data Provenance, but I will try adding
>>>>>> additional bulletin to see if something else surfaces.
>>>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke
>>>>>> <> wrote:
>>>>>>> Try adding a logAttribute processor after your encoding test
to see
>>>>>>> what values are actually getting assigned to the encoding attribute.
>>>>>>> Attribute are always stores as strings, so I don't think you
need to use the
>>>>>>> literal function. I would suggest trying ${encoding: matches
>>>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>> Matches is an exact match and values are case sensitive.
>>>>>>> If you set the bulletin level on the logAttribute processor to
>>>>>>> all the attribute key/value pairs will be displayed on the processor
>>>>>>> hovering over the bulletin (yellow post-it). They will also e
dumped to the
>>>>>>> app log.
>>>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <>
>>>>>>> wrote:
>>>>>>>> I am attempting to convert many files with various encoding
to a
>>>>>>>> common character set.  I have an attribute called 'encoding'
that stores the
>>>>>>>> result of an encoding test.  When passing that value as the
source to the
>>>>>>>> ConvertCharacterSet processor, it didn't match the processor's
>>>>>>>> values.  I added an UpdateAttribute processor that is attempting
to compare
>>>>>>>> 'encoding' to known valid Java character sets.  That comparison
is where I
>>>>>>>> am having trouble.  In SQL it would be "where encoding in
>>>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')."
>>>>>>>> Based on this document, I thought that 'literal' would be
a good
>>>>>>>> function combined with 'contains'.
>>>>>>>> Once the comparison is working, I will send the matching
files to the
>>>>>>>> ConvertCharacterSet processor.
>>>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke
>>>>>>>> <> wrote:
>>>>>>>>> Charlie,
>>>>>>>>>     I am not sure what your use case is here. 'Literal'
is not a
>>>>>>>>> NiFI expression language function. If you can give me
some detail on what
>>>>>>>>> you are trying to do, I can help you with the NiFi expression
>>>>>>>>> strategy to accomplish it. Did you create a FlowFile
attribute named
>>>>>>>>> 'encoding'?
>>>>>>>>> Matt
>>>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <>
>>>>>>>>> wrote:
>>>>>>>>>> Typos on my regex were just in the email, not the
processor.  It
>>>>>>>>>> should have read ${encoding:match...
>>>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure
>>>>>>>>>> <> wrote:
>>>>>>>>>>> This expression does not parse without error:
>>>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1'):contains(encoding)}
>>>>>>>>>>> Is it not possible to use an attribute in a comparison
>>>>>>>>>>> Unexpected token 'encoding' at line 1, column
73. Query:
>>>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1):contains(encoding)}
>>>>>>>>>>> Alternatively, I think a regex should work, but
didn't immediately
>>>>>>>>>>> get a match using:
>>>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>>>>> Charlie

View raw message