hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Tustin <mtus...@handybook.com>
Subject Re: Loading data containing newlines
Date Fri, 15 Jan 2016 21:39:24 GMT
I second this. I've generally found anything else to be disappointing when
working with data which is at all funky.

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com>
wrote:

> Time to use Spark and Spark-Sql in addition to Hive?
> It's probably going to happen sooner or later anyway.
>
> I sent you Spark solution yesterday.  (you just need to write
> unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function
> using BZip2CompressorInputStream and Super CSV API)
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov>
> wrote:
>
>> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
>> host.name/data/stg/table/
>>
>> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
>> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
>> ‘/data/stg/table/’
>>
>> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
>> ("orc.compress"="ZLIB");
>>
>> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
>> udf2(colb),functions(),etc. FROM ext_<table>
>>
>> 5.       Delete files from hdfs://host.name/data/stg/table/
>>
>>
>>
>> This has been working quite well, until our newest data contains fields
>> with embedded newlines.
>>
>>
>>
>> We are now looking into options further up the pipeline to see if we can
>> condition the data earlier in the process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
>> *Sent:* Wednesday, January 13, 2016 10:34 AM
>>
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Thanks Brian.
>>
>>
>>
>> Just to clarify do you use something like below?
>>
>>
>>
>> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
>> rhes564.hedat.net:9000/misc/t.bcp
>>
>> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …)
>> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY
>> ',' STORED AS ORC
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
>> *Sent:* 13 January 2016 18:12
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> We are pushing the compressed text files into HDFS directory for Hive
>> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
>> letting Hive handle the ORC file creation process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mich@peridale.co.uk>]
>>
>> *Sent:* Tuesday, January 12, 2016 4:41 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Hi Bryan,
>>
>>
>>
>> As a matter of interest are you loading text files into local directories
>> in encrypted format at all and then push it into HDFS/Hive as ORC?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
>> <Bryan.Gerber@pnnl.gov>]
>> *Sent:* 12 January 2016 17:41
>> *To:* user@hive.apache.org
>> *Subject:* Loading data containing newlines
>>
>>
>>
>> We are attempting to load CSV text files (compressed to bz2) containing
>> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
>> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
>> them to condition the data.
>>
>>
>>
>> A few days of research has us ready to implement custom  input/output
>> formats to handle the ingest.  Any other suggestions that may be less
>> effort with low impact to load times?
>>
>>
>>
>> Thanks,
>>
>> Bryan G.
>>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led 
by Fidelity


Mime
View raw message