phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@apache.org>
Subject Re: Bulk load for binay file formats
Date Thu, 07 Jan 2016 17:06:21 GMT
Hi Diego,

I recommend the latter -- creating HFiles directly from your application.
That is, unless you have a specific need for the intermediate format.

I recently did some work in this area, abstracting the bulkload tooling
somewhat to add support for loading from JSON files. I support a
continuation in this effort of abstraction/refactoring. Have a look at the
code in and around o.a.p.mapreduce.AbstractBulkLoadTool. Probably you can
implement your custom format reader based on that harness. If not, I'm
happy to review/commit any changes necessary to support other extensions.

Unfortunately right now the only API interface compatibility we support
across versions is the SQL interface. Which means we may make changes to
these classes from release to release. Perhaps not terribly often, but keep
this in mind as you press forward with your efforts.

Let us know if you have further questions,
-n

On Thursday, January 7, 2016, Fustes, Diego <Diego.Fustes@ndt-global.com>
wrote:

> Hi all,
>
>
>
> In our project we need to ingest big amounts of data (1TB stored in custom
> binary files) to HBase using Phoenix. To do so, at the moment, we are
> converting the binary files to CSV and using the bulk load tool included in
> Phoenix. Unfortunately, such process takes too long given that we need to
> store big files in HDFS (10TB in CSV), and then run the MapReduce job to
> convert these files to HFiles.
>
>
>
> I think that it should be considerably faster and compact to use another
> file format (For example Avro) as intermediate storage for bulk loading.
> Could this be implemented in the next releases of Phoenix?
>
>
>
> Another possibility is that we create the HFiles directly in our code. How
> complex would that be?
>
>
>
> With kind regards,
>
>
>
> Diego
>
>
>
>
>
>
>
> [image: Description: Description: cid:image001.png@01CF4378.72EDFE50]
>
> *NDT GDAC Spain S.L.*
>
> Diego Fustes, Big Data and Machine Learning Expert
>
> Gran Vía de les Corts Catalanes 130, 11th floor
>
> 08038 Barcelona, Spain
>
> Phone: +34 93 43 255 27
>
> diego.fustes@ndt-global.com
> <javascript:_e(%7B%7D,'cvml','diego.fustes@ndt-global.com');>
>
> *www.ndt-global.com <http://www.ndt-global.com/>*
>
>
>
> --
> This email is intended only for the recipient(s) designated above.  Any dissemination,
distribution, copying, or use of the information contained herein by anyone other than the
recipient(s) designated by the sender is unauthorized and strictly prohibited and subject
to legal privilege.  If you have received this e-mail in error, please notify the sender immediately
and delete and destroy this email.
>
> Der Inhalt dieser E-Mail und deren Anhänge sind vertraulich. Wenn Sie nicht der Adressat
sind, informieren Sie bitte den Absender unverzüglich, verwenden Sie den Inhalt nicht und
löschen Sie die E-Mail sofort.
>
> NDT Global GmbH and Co. KG,  Friedrich-List-Str. 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRA 704288
>
> Personally liable partner:
> NDT Verwaltungs GmbH
> Friedrich-List-Straße 1, D-76297 Stutensee, Germany
> Registry Court Mannheim
> HRB 714639
> CEO: Gunther Blitz
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message