nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Expression language
Date Fri, 13 Nov 2015 02:14:33 GMT
Under nifi-commons/nifi-utils/ there is a package called
org.apache.nifi.util.search  That is where the good stuff is.  Wild
man Tony Kurc bringing the high speed search heat there.

Thanks
Joe

On Thu, Nov 12, 2015 at 9:09 PM, Matt Burgess <mattyb149@gmail.com> wrote:
> That is awesome to hear, I didn't realize ScanContent worked that way, very cool!
>
> Sent from my iPhone
>
>> On Nov 12, 2015, at 8:40 PM, Joe Witt <joe.witt@gmail.com> wrote:
>>
>> User Experience - everything we do needs to be about continually
>> improving the user experience.  So yes for sure if you've got ideas on
>> how to provide a more intuitive play - yes please.  You will find an
>> implementation of aho corasick under the standard processors
>> (ScanContent) and the associated library under search tools.
>> Amazingly fast.
>>
>> Thanks!
>> Joe
>>
>>> On Thu, Nov 12, 2015 at 8:33 PM, Matt Burgess <mattyb149@gmail.com> wrote:
>>> Not sure if it would prove useful but I've started messing around with the
>>> Aho-Corasick algorithm in the hopes of the user being able to paste in some
>>> sample data and getting a regex out. If the data is "regular", the user
>>> wouldn't need to know an expression language, they would just need a
>>> representative sample of their data.
>>>
>>> Depending on how crazy I want to get, I might do cross-fold validation
>>> (rated against the algorithm on the whole set) for the sample input to see
>>> if it's really "regular" or that guessing a regex is just too hard for the
>>> given data.
>>>
>>> Anyway, do you think a "regex guesser" or "NiFi expression guesser" would be
>>> a valuable feature? The missing link is the translator from Finite State
>>> Machine (from Aho-Corasick) to the target model (regex or otherwise). The
>>> research has been done and there is code available (under GPL) so on purpose
>>> I did not read the paper or look at the source.
>>>
>>> Sorry in advance if I've gone too far afield here, I've just felt the pains
>>> of users trying to get the right recognizers for their data fields.
>>>
>>> Cheers,
>>> Matt
>>>
>>> Sent from my iPhone
>>>
>>> On Nov 12, 2015, at 7:54 PM, Joe Witt <joe.witt@gmail.com> wrote:
>>>
>>> We have to make this easier...
>>>
>>> Maybe we should give someone access to an inline expression editor and see
>>> the results.  Like in regexpal...
>>>
>>>> On Nov 12, 2015 7:26 PM, "Charlie Frasure" <charliefrasure@gmail.com>
wrote:
>>>>
>>>> Good call.  I added trim() to the matches command, and it seems to have
>>>> resolved the issue.  I was checking for sane lengths, but maybe there was
a
>>>> \n or something in there.  Problem for another day.  Thanks.
>>>>
>>>>
>>>> On Thu, Nov 12, 2015 at 7:13 PM, Matthew Clarke
>>>> <matt.clarke.138@gmail.com> wrote:
>>>>>
>>>>> Make sure your attribute name and value does not have white space on
>>>>> either side. A 'space' is a valid character and is often over looked.
"
>>>>> encoding" does not equal "encoding" or "encoding ". The same applies
for the
>>>>> attribute values.
>>>>>
>>>>> On Nov 12, 2015 7:07 PM, "Charlie Frasure" <charliefrasure@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Thanks.  I did use the matches syntax already and checked the attribute
>>>>>> values in each processor using Data Provenance, but I will try adding
the
>>>>>> additional bulletin to see if something else surfaces.
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 7:00 PM, Matthew Clarke
>>>>>> <matt.clarke.138@gmail.com> wrote:
>>>>>>>
>>>>>>> Try adding a logAttribute processor after your encoding test
to see
>>>>>>> what values are actually getting assigned to the encoding attribute.
>>>>>>> Attribute are always stores as strings, so I don't think you
need to use the
>>>>>>> literal function. I would suggest trying ${encoding: matches
>>>>>>> ('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>
>>>>>>> Matches is an exact match and values are case sensitive.
>>>>>>>
>>>>>>> If you set the bulletin level on the logAttribute processor to
'info',
>>>>>>> all the attribute key/value pairs will be displayed on the processor
by
>>>>>>> hovering over the bulletin (yellow post-it). They will also e
dumped to the
>>>>>>> app log.
>>>>>>>
>>>>>>> On Nov 12, 2015 6:40 PM, "Charlie Frasure" <charliefrasure@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am attempting to convert many files with various encoding
to a
>>>>>>>> common character set.  I have an attribute called 'encoding'
that stores the
>>>>>>>> result of an encoding test.  When passing that value as the
source to the
>>>>>>>> ConvertCharacterSet processor, it didn't match the processor's
expected
>>>>>>>> values.  I added an UpdateAttribute processor that is attempting
to compare
>>>>>>>> 'encoding' to known valid Java character sets.  That comparison
is where I
>>>>>>>> am having trouble.  In SQL it would be "where encoding in
('utf-8',
>>>>>>>> 'utf-16', 'utf-16be', 'utf-16le', 'us-ascii', 'iso-8859-1')."
>>>>>>>>
>>>>>>>> Based on this document, I thought that 'literal' would be
a good
>>>>>>>> function combined with 'contains'.
>>>>>>>>
>>>>>>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#literal
>>>>>>>>
>>>>>>>> Once the comparison is working, I will send the matching
files to the
>>>>>>>> ConvertCharacterSet processor.
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2015 at 6:24 PM, Matthew Clarke
>>>>>>>> <matt.clarke.138@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Charlie,
>>>>>>>>>     I am not sure what your use case is here. 'Literal'
is not a
>>>>>>>>> NiFI expression language function. If you can give me
some detail on what
>>>>>>>>> you are trying to do, I can help you with the NiFi expression
language
>>>>>>>>> strategy to accomplish it. Did you create a FlowFile
attribute named
>>>>>>>>> 'encoding'?
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> On Nov 12, 2015 6:15 PM, "Charlie Frasure" <charliefrasure@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Typos on my regex were just in the email, not the
processor.  It
>>>>>>>>>> should have read ${encoding:match...
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2015 at 6:03 PM, Charlie Frasure
>>>>>>>>>> <charliefrasure@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> This expression does not parse without error:
>>>>>>>>>>> ${literal('utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1'):contains(encoding)}
>>>>>>>>>>>
>>>>>>>>>>> Is it not possible to use an attribute in a comparison
function?
>>>>>>>>>>> Unexpected token 'encoding' at line 1, column
73. Query:
>>>>>>>>>>> ${literal(utf-8 utf-16 utf-16be utf-16le us-ascii
>>>>>>>>>>> iso-8859-1):contains(encoding)}
>>>>>>>>>>>
>>>>>>>>>>> Alternatively, I think a regex should work, but
didn't immediately
>>>>>>>>>>> get a match using:
>>>>>>>>>>>
>>>>>>>>>>> ${enconding.match('utf-8|utf-16|utf-16be|utf-16le|us-ascii|iso-8859-1')}
>>>>>>>>>>>
>>>>>>>>>>> Charlie
>>>

Mime
View raw message