Return-Path: X-Original-To: apmail-phoenix-dev-archive@minotaur.apache.org Delivered-To: apmail-phoenix-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 49FB3187E1 for ; Thu, 7 Jan 2016 16:07:02 +0000 (UTC) Received: (qmail 79710 invoked by uid 500); 7 Jan 2016 16:07:02 -0000 Delivered-To: apmail-phoenix-dev-archive@phoenix.apache.org Received: (qmail 79651 invoked by uid 500); 7 Jan 2016 16:07:02 -0000 Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@phoenix.apache.org Delivered-To: mailing list dev@phoenix.apache.org Delivered-To: moderator for dev@phoenix.apache.org Received: (qmail 39133 invoked by uid 99); 7 Jan 2016 10:55:31 -0000 X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4 X-Spam-Level: **** X-Spam-Status: No, score=4 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled From: "Fustes, Diego" To: "'dev@phoenix.apache.org'" Subject: Bulk load for binay file formats Thread-Topic: Bulk load for binay file formats Thread-Index: AdFJONt1Tn2cX+97RPyj50e1c3YEeg== Date: Thu, 7 Jan 2016 10:55:13 +0000 Message-ID: <94D1788FCD7BF74994109308ADEC87CA39F79CB1@NDT-DE-KA-MBX01.ndt.ndt-eng.de> Accept-Language: es-ES, de-DE, en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [172.28.8.24] Content-Type: multipart/related; boundary="_004_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_"; type="multipart/alternative" MIME-Version: 1.0 --_004_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_ Content-Type: multipart/alternative; boundary="_000_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_" --_000_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_ Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi all, In our project we need to ingest big amounts of data (1TB stored in custom = binary files) to HBase using Phoenix. To do so, at the moment, we are conve= rting the binary files to CSV and using the bulk load tool included in Phoe= nix. Unfortunately, such process takes too long given that we need to store= big files in HDFS (10TB in CSV), and then run the MapReduce job to convert= these files to HFiles. I think that it should be considerably faster and compact to use another fi= le format (For example Avro) as intermediate storage for bulk loading. Coul= d this be implemented in the next releases of Phoenix? Another possibility is that we create the HFiles directly in our code. How = complex would that be? With kind regards, Diego [Description: Description: cid:image001.png@01CF4378.72EDFE50] NDT GDAC Spain S.L. Diego Fustes, Big Data and Machine Learning Expert Gran V=C3=ADa de les Corts Catalanes 130, 11th floor 08038 Barcelona, Spain Phone: +34 93 43 255 27 diego.fustes@ndt-global.com www.ndt-global.com --=20 This email is intended only for the recipient(s) designated above. Any dis= semination, distribution, copying, or use of the information contained here= in by anyone other than the recipient(s) designated by the sender is unauth= orized and strictly prohibited and subject to legal privilege. If you have= received this e-mail in error, please notify the sender immediately and de= lete and destroy this email. Der Inhalt dieser E-Mail und deren Anh=C3=A4nge sind vertraulich. Wenn Sie = nicht der Adressat sind, informieren Sie bitte den Absender unverz=C3=BCgli= ch, verwenden Sie den Inhalt nicht und l=C3=B6schen Sie die E-Mail sofort. NDT Global GmbH and Co. KG, Friedrich-List-Str. 1, D-76297 Stutensee, Germ= any Registry Court Mannheim HRA 704288 Personally liable partner:=20 NDT Verwaltungs GmbH Friedrich-List-Stra=C3=9Fe 1, D-76297 Stutensee, Germany Registry Court Mannheim HRB 714639 CEO: Gunther Blitz --_000_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_ Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

Hi all,

 

In our project we need to ingest big amounts of data= (1TB stored in custom binary files) to HBase using Phoenix. To do so, at t= he moment, we are converting the binary files to CSV and using the bulk loa= d tool included in Phoenix. Unfortunately, such process takes too long given that we need to store big files in HDFS = (10TB in CSV), and then run the MapReduce job to convert these files to HFi= les.

 

I think that it should be considerably faster and co= mpact to use another file format (For example Avro) as intermediate storage= for bulk loading. Could this be implemented in the next releases of Phoeni= x?

 

Another possibility is that we create the HFiles dir= ectly in our code. How complex would that be?

 

With kind regards,

 

Diego

 

 

 

3D"Description:

NDT GDAC Spain S.L.

Diego Fustes, Big Data and Machine Learning Expert=

Gran V=C3=ADa de les Corts Catalanes 130, 11th floor

08038 Barcelona, Spain

Phone: +34 93 43 255 27

diego.fustes@ndt-global.com

www.ndt-global.com

 

--=20
This email is intended only for the recipient(s) designated above.  Any dis=
semination, distribution, copying, or use of the information contained here=
in by anyone other than the recipient(s) designated by the sender is unauth=
orized and strictly prohibited and subject to legal privilege.  If you have=
 received this e-mail in error, please notify the sender immediately and de=
lete and destroy this email.

Der Inhalt dieser E-Mail und deren Anh=C3=A4nge sind vertraulich. Wenn Sie =
nicht der Adressat sind, informieren Sie bitte den Absender unverz=C3=BCgli=
ch, verwenden Sie den Inhalt nicht und l=C3=B6schen Sie die E-Mail sofort.

NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germ=
any
Registry Court Mannheim
HRA 704288

Personally liable partner:=20
NDT Verwaltungs GmbH
Friedrich-List-Stra=C3=9Fe 1, D-76297 Stutensee, Germany
Registry Court Mannheim
HRB 714639
CEO: Gunther Blitz




--_000_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_-- --_004_94D1788FCD7BF74994109308ADEC87CA39F79CB1NDTDEKAMBX01ndt_--