nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NIFI-2712) Database Fetch processors' max-value columns don't work as expected
Date Tue, 06 Sep 2016 19:37:20 GMT


ASF GitHub Bot commented on NIFI-2712:

Github user jtstorck commented on the issue:
    +1 on this PR.
    Based on the scope of the code changes and the unit testing that has been introduced,
this addresses the use-case of being able query tables based on a hierarchy of columns that
take a primary column and partition columns into consideration.
    I did not run this against a live database, so the committer may want to do that for a
sanity check.  The unit tests pass and look like they cover the hierarchical partitioned query
use cases, and they use Derby as the database.
    There is one known issue that could occur with partitioning, in that not all data would
be fetched from a partition if new data comes in that provides a new value for a partition
column before all the data in the previous partition was retrieved. According to @mattyb149,
this is an edge case.  I think this issue can be avoided with flow design to account for this,
at any rate.

> Database Fetch processors' max-value columns don't work as expected
> -------------------------------------------------------------------
>                 Key: NIFI-2712
>                 URL:
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
> Currently, for QueryDatabaseTable and GenerateTableFetch, the user can enter any number
of maximum-value columns, which are used to generate a SQL query that will fetch all records
whose values are greater than the last-observed maximum values for those columns.
> However this makes multiple max-value columns not very useful, since they will both have
to increase in lockstep or records will be lost/skipped. In such a case, using one or the
other (but not both) would suffice, making multiple max-value columns useless.
> The more likely use case is that there are multiple columns whose values are strictly
increasing, but at different rates. This is common with very large tables where a column could
be for "date_created" and also a "bucket number" that strictly increases once a day. Queries
for a day's worth of data are more efficient if they can be filtered on "bucket" (in this
case), then on timestamp. However the generated SQL queries would have to reflect that "bucket"
may remain the same as timestamp is increasing, but once the bucket value has increased, then
only the (new) timestamps for that bucket should be fetched.

This message was sent by Atlassian JIRA

View raw message