nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe Witt (Jira)" <j...@apache.org>
Subject [jira] [Updated] (MINIFICPP-1199) Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML Inference on Edge
Date Mon, 04 May 2020 14:49:00 GMT

     [ https://issues.apache.org/jira/browse/MINIFICPP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joe Witt updated MINIFICPP-1199:
--------------------------------
    Priority: Blocker  (was: Major)

> Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML Inference
on Edge
> -----------------------------------------------------------------------------------------------
>
>                 Key: MINIFICPP-1199
>                 URL: https://issues.apache.org/jira/browse/MINIFICPP-1199
>             Project: Apache NiFi MiNiFi C++
>          Issue Type: New Feature
>    Affects Versions: master
>         Environment: Ubuntu 18.04 in AWS EC2
> MiNiFi C++ 0.7.0
>            Reporter: James Medel
>            Priority: Blocker
>             Fix For: master
>
>
> *MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors:
> Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's Python Scoring
Pipeline and MiNiFi's Custom Python Processors. Uses the Python Processors to execute the
Python Scoring Pipeline scorer to do batch scoring and real-time scoring for one or more predicted
labels on test data in the incoming flow file content. I would like to contribute my processors
to MiNiFi C++ as a new feature.
>  
> *3 custom python processors* created for MiNiFi:
> *H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline to do interactive
scoring (real-time) scoring on an individual row or list of test data within each incoming
flow file. Uses H2O's open-source Datatable library to load test data into a frame, then converts
it to pandas dataframe. Pandas is used to convert the pandas dataframe rows to a list of lists,
but since each flow file passing through this processor should have only 1 row, we extract
the 1st list. Then that list is passed into the Driverless AI's Python scorer.score() function
to predict one or more predicted labels. The prediction is returned to a list. The number
of predicted labels is specified when the user built the Python Scoring Pipeline in Driverless
AI. With that knowledge, there is a property for the user to pass in one or more predicted
label names that will be used as the predicted header. I create a comma separated string using
the predicted header and predicted value. The predicted header(s) is on one line followed
by a newline and the predicted value(s) is on the next line followed by a newline. The string
is written to the flow file content. Flow File attributes are added to the flow file for the
number of lists scored and the predicted label name and its associated score. Finally, the
flow file is transferred on a success relationship.
>  
> *H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline to do batch
scoring on a frame of data within each incoming flow file. Uses H2O's open-source Datatable
library to load test data into a frame. Each frame from the flow file passing through this
processor should have multiple rows. That frame is passed into the Driverless AI's Python
scorer.score_batch() function to predict one or more predicted labels. The prediction is returned
to a pandas dataframe, then that dataframe is converted to a string, so it can be written
to the flow file content. Flow File attributes are added to the flow file for the number of
rows scored. There are also flow file attributes added for the predicted label name and its
associated score for the first row in the frame. Finally, the flow file is transferred on
a success relationship.
>  
> *ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses H2O's open-source
Datatable library to load data into a frame, then converts it to pandas dataframe. Pandas
is used to convert the pandas dataframe to a csv and store it into in-memory text stream StringIO
without pandas dataframe index. The csv string data is grabbed using file read() function
on the StringIO object, so it can be written to the flow file content. The flow file is transferred
on a success relationship.
>  
> *Hydraulic System Condition Monitoring* Data used in MiNiFi Flow:
> The sensor test data I used in this integration comes from [Kaggle: Condition Monitoring
of Hydraulic Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]].
I was able to predict hydraulic system cooling efficiency through MiNiFi and H2O integration
described above. This use case here is hydraulic system predictive maintenance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message