Return-Path: X-Original-To: apmail-phoenix-dev-archive@minotaur.apache.org Delivered-To: apmail-phoenix-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF03418B13 for ; Thu, 7 Jan 2016 17:06:23 +0000 (UTC) Received: (qmail 40989 invoked by uid 500); 7 Jan 2016 17:06:23 -0000 Delivered-To: apmail-phoenix-dev-archive@phoenix.apache.org Received: (qmail 40935 invoked by uid 500); 7 Jan 2016 17:06:23 -0000 Mailing-List: contact dev-help@phoenix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@phoenix.apache.org Delivered-To: mailing list dev@phoenix.apache.org Received: (qmail 40918 invoked by uid 99); 7 Jan 2016 17:06:23 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jan 2016 17:06:23 +0000 Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 10BBE1A023C for ; Thu, 7 Jan 2016 17:06:23 +0000 (UTC) Received: by mail-wm0-f50.google.com with SMTP id f206so133801518wmf.0 for ; Thu, 07 Jan 2016 09:06:22 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.28.21.6 with SMTP id 6mr19172151wmv.46.1452186381764; Thu, 07 Jan 2016 09:06:21 -0800 (PST) Received: by 10.28.60.139 with HTTP; Thu, 7 Jan 2016 09:06:21 -0800 (PST) In-Reply-To: <94D1788FCD7BF74994109308ADEC87CA39F79CB1@NDT-DE-KA-MBX01.ndt.ndt-eng.de> References: <94D1788FCD7BF74994109308ADEC87CA39F79CB1@NDT-DE-KA-MBX01.ndt.ndt-eng.de> Date: Thu, 7 Jan 2016 09:06:21 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Bulk load for binay file formats From: Nick Dimiduk To: "dev@phoenix.apache.org" Content-Type: multipart/alternative; boundary=001a1146f0de6ed1d00528c17dfc --001a1146f0de6ed1d00528c17dfc Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Diego, I recommend the latter -- creating HFiles directly from your application. That is, unless you have a specific need for the intermediate format. I recently did some work in this area, abstracting the bulkload tooling somewhat to add support for loading from JSON files. I support a continuation in this effort of abstraction/refactoring. Have a look at the code in and around o.a.p.mapreduce.AbstractBulkLoadTool. Probably you can implement your custom format reader based on that harness. If not, I'm happy to review/commit any changes necessary to support other extensions. Unfortunately right now the only API interface compatibility we support across versions is the SQL interface. Which means we may make changes to these classes from release to release. Perhaps not terribly often, but keep this in mind as you press forward with your efforts. Let us know if you have further questions, -n On Thursday, January 7, 2016, Fustes, Diego wrote: > Hi all, > > > > In our project we need to ingest big amounts of data (1TB stored in custo= m > binary files) to HBase using Phoenix. To do so, at the moment, we are > converting the binary files to CSV and using the bulk load tool included = in > Phoenix. Unfortunately, such process takes too long given that we need to > store big files in HDFS (10TB in CSV), and then run the MapReduce job to > convert these files to HFiles. > > > > I think that it should be considerably faster and compact to use another > file format (For example Avro) as intermediate storage for bulk loading. > Could this be implemented in the next releases of Phoenix? > > > > Another possibility is that we create the HFiles directly in our code. Ho= w > complex would that be? > > > > With kind regards, > > > > Diego > > > > > > > > [image: Description: Description: cid:image001.png@01CF4378.72EDFE50] > > *NDT GDAC Spain S.L.* > > Diego Fustes, Big Data and Machine Learning Expert > > Gran V=C3=ADa de les Corts Catalanes 130, 11th floor > > 08038 Barcelona, Spain > > Phone: +34 93 43 255 27 > > diego.fustes@ndt-global.com > > > *www.ndt-global.com * > > > > -- > This email is intended only for the recipient(s) designated above. Any d= issemination, distribution, copying, or use of the information contained he= rein by anyone other than the recipient(s) designated by the sender is unau= thorized and strictly prohibited and subject to legal privilege. If you ha= ve received this e-mail in error, please notify the sender immediately and = delete and destroy this email. > > Der Inhalt dieser E-Mail und deren Anh=C3=A4nge sind vertraulich. Wenn Si= e nicht der Adressat sind, informieren Sie bitte den Absender unverz=C3=BCg= lich, verwenden Sie den Inhalt nicht und l=C3=B6schen Sie die E-Mail sofort= . > > NDT Global GmbH and Co. KG, Friedrich-List-Str. 1, D-76297 Stutensee, Ge= rmany > Registry Court Mannheim > HRA 704288 > > Personally liable partner: > NDT Verwaltungs GmbH > Friedrich-List-Stra=C3=9Fe 1, D-76297 Stutensee, Germany > Registry Court Mannheim > HRB 714639 > CEO: Gunther Blitz > > > > > > --001a1146f0de6ed1d00528c17dfc--