Mailing-List: contact dev-help@nifi.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@nifi.apache.org
From: Matt Burgess <mattyb149@gmail.com>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (1.0)
Date: Wed, 3 May 2017 15:47:49 -0400
Subject: Re: Data Load
Message-Id: <54028BF3-2A81-4307-A234-3F0FB1C17D1C@gmail.com>
References: <CAG6AKAH9WMyoWZJk=EhJ69ZdDBi8Bd0LPphmYuA4Y=eQArSvAg@mail.gmail.com> <CAEV8zdVUy_R8a9Mdott5mpLP0ZDcZ0M5Q9HFPacHk1-ngFKEtA@mail.gmail.com> <CAG6AKAEyvCQ2-OZ6203Zbar8W6Q5W807qRSzrcGHXaR-isAy7A@mail.gmail.com> <CAEV8zdUfH4kLPgSDhN4arLD4cF1e33jcumZ_E0b4VVOAkr-7ZQ@mail.gmail.com> <CAG6AKAFBi8PMLXccjYgXM9dQE_-c1g0jucfAOLbEPdK2LaZFmA@mail.gmail.com>
In-Reply-To: <CAG6AKAFBi8PMLXccjYgXM9dQE_-c1g0jucfAOLbEPdK2LaZFmA@mail.gmail.com>
To: dev@nifi.apache.org
archived-at: Wed, 03 May 2017 19:47:58 -0000

Anil,

When you say huge volumes, do you mean a large number of tables, or large ta=
bles, or both?

For a large number of tables, you will likely want to upgrade to the upcomin=
g NiFi release so you can use ListDatabaseTables -> GenerateTableFetch -> Ex=
ecuteSQL for the source part, although in the meantime you could use ListDat=
abaseTables -> ReplaceText (to set the SELECT query) -> ExecuteSQL.

For large tables on a single NiFi instance, I recommend using QueryDatabaseT=
able. Whether you have a max-value column or not, QDT lets you fetch smaller=
 batches of rows per flow file, versus ExecuteSQL which puts the whole resul=
t set in one flow file, which can lead to memory issues.

For scalability with large tables, I recommend a NiFi cluster of 3-10 nodes,=
 using a flow of GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL for t=
he source part. In 1.2.0 you'll be able to have ListDatabaseTables at the fr=
ont, to support a large number of tables. The RPG -> Input Port part is to d=
istribute the flow files across the cluster, the downstream flow is executed=
 in parallel with a subset of the incoming flow files (rather than each copy=
 of the flow getting every flow file).

Regards,
Matt


> On May 3, 2017, at 1:58 PM, Anil Rai <anilrainifi@gmail.com> wrote:
>=20
> Hi Matt,
>=20
> I quickly developed this and this is how i could do this
>=20
> DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->Rep=
laceText->PutSQL->Postgres(onCloud)
>=20
> The problem is, this will not scale for huge volumes. Any thoughts?
>=20
> Regards
> Anil
>=20
>=20
>> On Tue, May 2, 2017 at 12:07 PM, Matt Burgess <mattyb149@apache.org> wrot=
e:
>>=20
>> Yes that sounds like your best bet, assuming you have the "Maximum
>> Value Column" present in the table you want to migrate.  Then a flow
>> might look like:
>>=20
>> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
>>=20
>> In this flow the target tables would need to be created beforehand.
>> You might be able to do that with pg_dump or with some fancy SQL that
>> you could send to PutSQL in a separate (do-ahead) flow [1].  For
>> multiple tables, you will need one QueryDatabaseTable for each table;
>> depending on the number of tables and the latency for getting/putting
>> rows, you may be able to share the downstream processing. If that
>> creates a bottleneck, you may want a copy of the above flow for each
>> table.  This is drastically improved in NiFi 1.2.0, as you can use
>> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port ->
>> ExecuteSQL to perform the migration in parallel across a NiFi cluster.
>>=20
>> Regards,
>> Matt
>>=20
>> [1] https://serverfault.com/questions/231952/is-there-a-
>> mysql-equivalent-of-show-create-table-in-postgres
>>=20
>>=20
>>> On Tue, May 2, 2017 at 11:18 AM, Anil Rai <anilrainifi@gmail.com> wrote:=

>>> Thanks Matt for the quick reply. We are using nifi 1.0 release as of now=
.
>>> It's a postgres DB on both sides (on prem and on cloud)
>>> and yes incremental load is what i am looking for.....
>>> so with that, you recommend # 2 option?
>>>=20
>>> On Tue, May 2, 2017 at 11:00 AM, Matt Burgess <mattyb149@apache.org>
>> wrote:
>>>=20
>>>> Anil,
>>>>=20
>>>> Is this a "one-time" migration, meaning you would take the on-prem
>>>> tables and put them on the cloud DB just once? Or would this be an
>>>> incremental operation, where you do the initial move and then take any
>>>> "new" rows from the source and apply them to the target?  For the
>>>> latter, there are a couple of options:
>>>>=20
>>>> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL ->
>>>> [processors] -> PutSQL after perhaps deleting your target
>>>> DB/tables/etc.  This could be time-consuming and expensive. The
>>>> processors in-between probably include ConvertAvroToJSON and
>>>> ConvertJSONToSQL.
>>>> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get
>>>> the source data. For this your table would need a column whose values
>>>> always increase, that column would comprise the value of the "Maximum
>>>> Value Column" property in the aforementioned processors' configuration
>>>> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch
>>>> for each table in your DB.
>>>>=20
>>>> In addition to these current solutions, as of the upcoming NiFi 1.2.0
>>>> release, you have the following options:
>>>> 3) If the source database is MySQL, you can use the CaptureChangeMySQL
>>>> processor to get binary log events flowing through various processors
>>>> into PutDatabaseRecord to place them at the source. This pattern is
>>>> true Change Data Capture (CDC) versus the other two options above.
>>>> 4) Option #2 will be improved such that GenerateTableFetch will accept
>>>> incoming flow files, so you can use ListDatabaseTables ->
>>>> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one
>>>> flow.
>>>>=20
>>>> If this is a one-time migration, a data flow tool might not be the
>>>> best choice, you could consider something like Flyway [1] instead.
>>>>=20
>>>> Regards,
>>>> Matt
>>>>=20
>>>> [1] https://flywaydb.org/documentation/command/migrate
>>>>=20
>>>> On Tue, May 2, 2017 at 10:41 AM, Anil Rai <anilrainifi@gmail.com>
>> wrote:
>>>>> I have a simple use case.
>>>>>=20
>>>>> DB (On Premise) and DB (On Cloud).
>>>>>=20
>>>>> I want to use nifi to extract data from on prem DB (huge volumes) and
>>>>> insert into the same table structure that is hosted on cloud.
>>>>>=20
>>>>> I could use ExecuteSQL on both sides of the fence (to extract from on
>>>> prem
>>>>> and insert onto cloud). What processors are needed in between (if at
>>>> all)?
>>>>> As i am not doing any transformations at all....it is just extract and=

>>>> load
>>>>> use case
>>>>=20
>>=20