Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F1AD9200C68 for ; Wed, 3 May 2017 21:47:57 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id EEB06160BB5; Wed, 3 May 2017 19:47:57 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 17616160BA1 for ; Wed, 3 May 2017 21:47:56 +0200 (CEST) Received: (qmail 89536 invoked by uid 500); 3 May 2017 19:47:56 -0000 Mailing-List: contact dev-help@nifi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@nifi.apache.org Delivered-To: mailing list dev@nifi.apache.org Received: (qmail 89524 invoked by uid 99); 3 May 2017 19:47:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 May 2017 19:47:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 539DCC0B3C for ; Wed, 3 May 2017 19:47:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.63 X-Spam-Level: X-Spam-Status: No, score=0.63 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Kt-Mvxt0Q_Wa for ; Wed, 3 May 2017 19:47:53 +0000 (UTC) Received: from mail-it0-f49.google.com (mail-it0-f49.google.com [209.85.214.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 17FB75FB29 for ; Wed, 3 May 2017 19:47:53 +0000 (UTC) Received: by mail-it0-f49.google.com with SMTP id x188so85780itb.0 for ; Wed, 03 May 2017 12:47:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:date:subject:message-id :references:in-reply-to:to; bh=d4eIkxb5DUG2YiDHaH57iEX11tKie0gTjiphNvBkoew=; b=amV2TfDIUt0/2KrhzhkcLc7GT0hu0MUszLTTvBGd5bdvm4lbn5SUe3e39HJpoIQi/v G1ImGAx8q4saWODvDerE/nR60g7Axy48KMlV/175lQ6C0jayEsYSfS7SLysYHamh+u5V GQ0xfoc54dA+/0DHXFzRKkxzKovgVS1qHfZFNSemVRgOkyVLZaEtBuZnTQQCJJs08zW9 Dx62+p/+1Pqj3eXioUQWxq5uaLWsPSPC/+dUe8oxRdAmJrXHaKBnBzPW/3jk3oYO1fEf oarBSEr8M3stZUnn0lfABV66kcKriIn42fZ2jvZiZwvK0zoFU8B66qhWLe81gwkvY0g+ 1JAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version:date :subject:message-id:references:in-reply-to:to; bh=d4eIkxb5DUG2YiDHaH57iEX11tKie0gTjiphNvBkoew=; b=G7NZ4fOK47IXU/5shhDY7di1ZotgDUoQ7+1Wb2XZMqKiecdOdPlJjhfD0SZ1sF9ZZ9 ahBk2aGqu9HUwfIQ1XJxWk4p1vKO3USmqFuPddeEUC475ssEM0s2RzXodvqLzRtzdCjP klBz6UbPzPNRX38OIKKi6m+T/HG8uEqGw1eX8uO82/dpfN9eVr4dUu2+HiS14CyzqMUJ c293DEpA4+/23FYCNUskxq/dc3LTT4L6IrvAK+LIcBr+IqBOkdCojzH+KrpXf5G2OJtc 1jMqX5ReTNjabWC4ShM5iTXt2P2PXxshQz4/E8ZyecsDB5fgUGT7OR9vXc9wmbb6saP8 qHxQ== X-Gm-Message-State: AN3rC/70EXsIwgpjcD3l5gdiv5ykbAUgDBMfe+jxamqHxYkIyyIr6rtJ EVHAQhHumkI6A2poe9M= X-Received: by 10.202.243.194 with SMTP id r185mr10564339oih.21.1493840871419; Wed, 03 May 2017 12:47:51 -0700 (PDT) Received: from [10.156.33.91] (mobile-166-176-120-9.mycingular.net. [166.176.120.9]) by smtp.gmail.com with ESMTPSA id 59sm9977193oth.49.2017.05.03.12.47.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 May 2017 12:47:50 -0700 (PDT) From: Matt Burgess Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Date: Wed, 3 May 2017 15:47:49 -0400 Subject: Re: Data Load Message-Id: <54028BF3-2A81-4307-A234-3F0FB1C17D1C@gmail.com> References: In-Reply-To: To: dev@nifi.apache.org X-Mailer: iPhone Mail (14E304) archived-at: Wed, 03 May 2017 19:47:58 -0000 Anil, When you say huge volumes, do you mean a large number of tables, or large ta= bles, or both? For a large number of tables, you will likely want to upgrade to the upcomin= g NiFi release so you can use ListDatabaseTables -> GenerateTableFetch -> Ex= ecuteSQL for the source part, although in the meantime you could use ListDat= abaseTables -> ReplaceText (to set the SELECT query) -> ExecuteSQL. For large tables on a single NiFi instance, I recommend using QueryDatabaseT= able. Whether you have a max-value column or not, QDT lets you fetch smaller= batches of rows per flow file, versus ExecuteSQL which puts the whole resul= t set in one flow file, which can lead to memory issues. For scalability with large tables, I recommend a NiFi cluster of 3-10 nodes,= using a flow of GenerateTableFetch -> RPG -> Input Port -> ExecuteSQL for t= he source part. In 1.2.0 you'll be able to have ListDatabaseTables at the fr= ont, to support a large number of tables. The RPG -> Input Port part is to d= istribute the flow files across the cluster, the downstream flow is executed= in parallel with a subset of the incoming flow files (rather than each copy= of the flow getting every flow file). Regards, Matt > On May 3, 2017, at 1:58 PM, Anil Rai wrote: >=20 > Hi Matt, >=20 > I quickly developed this and this is how i could do this >=20 > DataLake<-ExecuteSQL->ConvertAveroToJson->SplitJson->EvaluateJsonPath->Rep= laceText->PutSQL->Postgres(onCloud) >=20 > The problem is, this will not scale for huge volumes. Any thoughts? >=20 > Regards > Anil >=20 >=20 >> On Tue, May 2, 2017 at 12:07 PM, Matt Burgess wrot= e: >>=20 >> Yes that sounds like your best bet, assuming you have the "Maximum >> Value Column" present in the table you want to migrate. Then a flow >> might look like: >>=20 >> QueryDatabaseTable -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL >>=20 >> In this flow the target tables would need to be created beforehand. >> You might be able to do that with pg_dump or with some fancy SQL that >> you could send to PutSQL in a separate (do-ahead) flow [1]. For >> multiple tables, you will need one QueryDatabaseTable for each table; >> depending on the number of tables and the latency for getting/putting >> rows, you may be able to share the downstream processing. If that >> creates a bottleneck, you may want a copy of the above flow for each >> table. This is drastically improved in NiFi 1.2.0, as you can use >> ListDatabaseTables -> GenerateTableFetch -> RPG -> Input Port -> >> ExecuteSQL to perform the migration in parallel across a NiFi cluster. >>=20 >> Regards, >> Matt >>=20 >> [1] https://serverfault.com/questions/231952/is-there-a- >> mysql-equivalent-of-show-create-table-in-postgres >>=20 >>=20 >>> On Tue, May 2, 2017 at 11:18 AM, Anil Rai wrote:= >>> Thanks Matt for the quick reply. We are using nifi 1.0 release as of now= . >>> It's a postgres DB on both sides (on prem and on cloud) >>> and yes incremental load is what i am looking for..... >>> so with that, you recommend # 2 option? >>>=20 >>> On Tue, May 2, 2017 at 11:00 AM, Matt Burgess >> wrote: >>>=20 >>>> Anil, >>>>=20 >>>> Is this a "one-time" migration, meaning you would take the on-prem >>>> tables and put them on the cloud DB just once? Or would this be an >>>> incremental operation, where you do the initial move and then take any >>>> "new" rows from the source and apply them to the target? For the >>>> latter, there are a couple of options: >>>>=20 >>>> 1) Rebuild the cloud DB periodically. You can use ExecuteSQL -> >>>> [processors] -> PutSQL after perhaps deleting your target >>>> DB/tables/etc. This could be time-consuming and expensive. The >>>> processors in-between probably include ConvertAvroToJSON and >>>> ConvertJSONToSQL. >>>> 2) Use QueryDatabaseTable or (GenerateTableFetch -> ExecuteSQL) to get >>>> the source data. For this your table would need a column whose values >>>> always increase, that column would comprise the value of the "Maximum >>>> Value Column" property in the aforementioned processors' configuration >>>> dialogs. You would need one QueryDatabaseTable or GenerateTableFetch >>>> for each table in your DB. >>>>=20 >>>> In addition to these current solutions, as of the upcoming NiFi 1.2.0 >>>> release, you have the following options: >>>> 3) If the source database is MySQL, you can use the CaptureChangeMySQL >>>> processor to get binary log events flowing through various processors >>>> into PutDatabaseRecord to place them at the source. This pattern is >>>> true Change Data Capture (CDC) versus the other two options above. >>>> 4) Option #2 will be improved such that GenerateTableFetch will accept >>>> incoming flow files, so you can use ListDatabaseTables -> >>>> GenerateTableFetch -> ExecuteSQL to handle multiple tables with one >>>> flow. >>>>=20 >>>> If this is a one-time migration, a data flow tool might not be the >>>> best choice, you could consider something like Flyway [1] instead. >>>>=20 >>>> Regards, >>>> Matt >>>>=20 >>>> [1] https://flywaydb.org/documentation/command/migrate >>>>=20 >>>> On Tue, May 2, 2017 at 10:41 AM, Anil Rai >> wrote: >>>>> I have a simple use case. >>>>>=20 >>>>> DB (On Premise) and DB (On Cloud). >>>>>=20 >>>>> I want to use nifi to extract data from on prem DB (huge volumes) and >>>>> insert into the same table structure that is hosted on cloud. >>>>>=20 >>>>> I could use ExecuteSQL on both sides of the fence (to extract from on >>>> prem >>>>> and insert onto cloud). What processors are needed in between (if at >>>> all)? >>>>> As i am not doing any transformations at all....it is just extract and= >>>> load >>>>> use case >>>>=20 >>=20