nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Späth (JIRA) <j...@apache.org>
Subject [jira] [Updated] (NIFI-4385) Adjust the QueryDatabaseTable processor for handling big tables.
Date Fri, 15 Sep 2017 11:42:00 GMT

     [ https://issues.apache.org/jira/browse/NIFI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Späth updated NIFI-4385:
----------------------------
    Description: 
When querying large database tables, the *QueryDatabaseTable* processor does not perform very
well.
The processor will always perform the full query and then transfer all flowfiles as a list
instead of 
transferring them particularly after the *ResultSet* is fetching the next rows(If a fetch
size is given). 
If you want to query a billion rows from a table, 
the processor will add all flowfiles in an ArrayList<FlowFile> in memory 
before transferring the whole list after the last row is fetched by the ResultSet. 
I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java* 
and in my opinion, it would be no big deal to move the session.transfer to a proper position
in the code (into the while loop where the flowfile is added to the list) to 
achieve a real _stream support_. There was also a bug report for this problem 
which resulted in adding the new property *Maximum Number of Fragments*, 
but this property will just limit the results. 
Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments* to get
your limit, 
which is not really a solution for the original problem imho. 
Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower than
using a database cursor or a ResultSet
and stream the rows in flowfiles directly in the queue.





  was:
When querying large database tables, the *QueryDatabaseTable * processor does not perform
very well.
The processor will always perform the full query and then transfer all flowfiles as a list
instead of 
transferring them particularly after the *ResultSet *is fetching the next rows(If a fetch
size is given). 
If you want to query a billion rows from a table, 
the processor will add all flowfiles in an ArrayList<FlowFile> in memory 
before transferring the whole list after the last row is fetched by the ResultSet. 
I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java* 
and in my opinion, it would be no big deal to move the session.transfer to a proper position
in the code (into the while loop where the flowfile is added to the list) to 
achieve a real _stream support_. There was also a bug report for this problem 
which resulted in adding the new property *Maximum Number of Fragments*, 
but this property will just limit the results. 
Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments* to get
your limit, 
which is not really a solution for the original problem imho. 
Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower than
using a database cursor or a ResultSet
and stream the rows in flowfiles directly in the queue.






> Adjust the QueryDatabaseTable processor for handling big tables.
> ----------------------------------------------------------------
>
>                 Key: NIFI-4385
>                 URL: https://issues.apache.org/jira/browse/NIFI-4385
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.3.0
>            Reporter: Tim Späth
>
> When querying large database tables, the *QueryDatabaseTable* processor does not perform
very well.
> The processor will always perform the full query and then transfer all flowfiles as a
list instead of 
> transferring them particularly after the *ResultSet* is fetching the next rows(If a fetch
size is given). 
> If you want to query a billion rows from a table, 
> the processor will add all flowfiles in an ArrayList<FlowFile> in memory 
> before transferring the whole list after the last row is fetched by the ResultSet. 
> I've checked the code in *org.apache.nifi.processors.standard.QueryDatabaseTable.java*

> and in my opinion, it would be no big deal to move the session.transfer to a proper position
in the code (into the while loop where the flowfile is added to the list) to 
> achieve a real _stream support_. There was also a bug report for this problem 
> which resulted in adding the new property *Maximum Number of Fragments*, 
> but this property will just limit the results. 
> Now you have to multiply *Max Rows Per Flow File* with *Maximum Number of Fragments*
to get your limit, 
> which is not really a solution for the original problem imho. 
> Also the workaround with GenerateTableFetch and/or ExecuteSQL processors is much slower
than using a database cursor or a ResultSet
> and stream the rows in flowfiles directly in the queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message