Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65743182B4 for ; Fri, 15 Jan 2016 23:34:02 +0000 (UTC) Received: (qmail 38810 invoked by uid 500); 15 Jan 2016 23:33:59 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 38737 invoked by uid 500); 15 Jan 2016 23:33:59 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 38727 invoked by uid 99); 15 Jan 2016 23:33:59 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jan 2016 23:33:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7157718060B for ; Fri, 15 Jan 2016 23:33:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.103 X-Spam-Level: **** X-Spam-Status: No, score=4.103 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_COUK=1.1, LOTS_OF_MONEY=0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id SzYHf8ZYST9L for ; Fri, 15 Jan 2016 23:33:43 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTP id 9CCE0258C7 for ; Fri, 15 Jan 2016 23:33:42 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id A4B796A4474 for ; Fri, 15 Jan 2016 23:33:40 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 756971248405 for ; Fri, 15 Jan 2016 23:33:40 +0000 (GMT) From: "Mich Talebzadeh" To: References: <04de01d14d9b$1a2ef650$4e8ce2f0$@peridale.co.uk> <002101d14e30$f060af80$d1220e80$@peridale.co.uk> <007a01d14fdd$ad98bd10$08ca3730$@peridale.co.uk> <008401d14fe9$02449b80$06cdd280$@peridale.co.uk> In-Reply-To: Subject: RE: Loading data containing newlines Date: Fri, 15 Jan 2016 23:33:43 -0000 Message-ID: <009a01d14fed$2c343f00$849cbd00$@peridale.co.uk> MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_009B_01D14FED.2C3DDBF0" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQH9aLQU2xKSDC8vWpAjjhVXtmSwcAIAtU/SAakwlooB6vWyZgHcWSjEAYdbHDECX6MGJQLJa2cBAY5yyGQCYX20BwHwpY+LngT3LkA= Content-Language: en-gb This is a multipart message in MIME format. ------=_NextPart_000_009B_01D14FED.2C3DDBF0 Content-Type: multipart/alternative; boundary="----=_NextPart_001_009C_01D14FED.2C3DDBF0" ------=_NextPart_001_009C_01D14FED.2C3DDBF0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks Ryan, Very useful to know indeed=20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Ryan Harris [mailto:Ryan.Harris@zionsbancorp.com]=20 Sent: 15 January 2016 23:31 To: user@hive.apache.org Subject: RE: Loading data containing newlines =20 Mich, if you have a toolpath that you can use to pipeline the required = edits to the source file, you can use a chain similar to this: =20 hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t = ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put = - /etl/${table_name}/load/${orig_filename}.gz =20 to clean up your source input data as you drop it into the initial = external table location that hive will use in a hive based ELT chain. =20 It really depends on your upstream data path....if data were being = collected by flume, you might be able to clean it up there. It is also = possible to handle this with custom hive serdes, but it depends on where = you want to write the code and how much existing data you already have = to deal with. =20 Spark is also a very flexible and useful tool for this sort of problem, = as well as numerous advantages when used as an execution engine, but = setting up spark strictly to resolve this issue seems like overkill to = me. =20 =20 From: Mich Talebzadeh [mailto:mich@peridale.co.uk]=20 Sent: Friday, January 15, 2016 4:04 PM To: user@hive.apache.org =20 Subject: RE: Loading data containing newlines =20 Ok but I believe there are other similar approaches. =20 I can take a raw csv file and customize it using existing shell commands = like sed, awk, cut, grep etc among them getting rid of blank lines or = replacing silly characters. =20 Bottom line I want to =E2=80=9Ceventually=E2=80=9D store that csv file = in a hive table in a format that I can use sql queries on it.=20 =20 Is that a viable alternative? =20 Thanks =20 =20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Marcin Tustin [mailto:mtustin@handybook.com]=20 Sent: 15 January 2016 21:51 To: user@hive.apache.org =20 Subject: Re: Loading data containing newlines =20 You can open a file as an RDD of lines, and map whatever custom = tokenisation function you want over it; alternatively you can partition = down to a reasonable size and use map_partitions to map the standard = python csv parser over the partitions. =20 In general, the advantage of spark is that you can do anything you like = rather than being limited to a specific set of primitives.=20 =20 On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh > wrote: Hi Marcin, =20 Can you be specific in what way Spark is better suited for this = operation compared to Hive? =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Marcin Tustin [mailto:mtustin@handybook.com = ]=20 Sent: 15 January 2016 21:39 To: user@hive.apache.org =20 Subject: Re: Loading data containing newlines =20 I second this. I've generally found anything else to be disappointing = when working with data which is at all funky.=20 =20 On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov = > wrote: Time to use Spark and Spark-Sql in addition to Hive? It's probably going to happen sooner or later anyway. =20 I sent you Spark solution yesterday. (you just need to write = unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] function = using BZip2CompressorInputStream and Super CSV API) you can download spark, open spark-shell and run/debug the program on a = single computer =20 and then run it on cluster if needed (e.g. Amazon EMR can spin up = Spark cluster in 7 min) =20 On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W > wrote: 1. hdfs dfs -copyFromLocal /incoming/files/*.bz2 = hdfs://host.name/data/stg/table/ =20 2. CREATE EXTERNAL TABLE stg_ (cols=E2=80=A6) ROW FORMAT = serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE = LOCATION =E2=80=98/data/stg/table/=E2=80=99 3. CREATE TABLE
(cols=E2=80=A6) STORE AS ORC = tblproperties ("orc.compress"=3D"ZLIB"); 4. INSERT INTO TABLE
SELECT cols, udf1(cola), = udf2(colb),functions(),etc. FROM ext_
5. Delete files from hdfs://host.name/data/stg/table/ = =20 =20 This has been working quite well, until our newest data contains fields = with embedded newlines. =20 We are now looking into options further up the pipeline to see if we can = condition the data earlier in the process. =20 From: Mich Talebzadeh [mailto:mich@peridale.co.uk = ]=20 Sent: Wednesday, January 13, 2016 10:34 AM To: user@hive.apache.org =20 Subject: RE: Loading data containing newlines =20 Thanks Brian. =20 Just to clarify do you use something like below? =20 1. hdfs dfs -copyFromLocal /var/tmp/t.bcp = hdfs://rhes564.hedat.net:9000/misc/t.bcp = =20 2. CREATE EXTERNAL TABLE
name (col1 INT, col2 string, = =E2=80=A6) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS = TERMINATED BY ',' STORED AS ORC =20 Cheers, =20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility. =20 From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] = =20 Sent: 13 January 2016 18:12 To: user@hive.apache.org =20 Subject: RE: Loading data containing newlines =20 We are pushing the compressed text files into HDFS directory for Hive = EXTERNAL table, then using an INSERT on the table using ORC storage. We = are letting Hive handle the ORC file creation process. =20 From: Mich Talebzadeh [mailto:mich@peridale.co.uk]=20 Sent: Tuesday, January 12, 2016 4:41 PM To: user@hive.apache.org =20 Subject: RE: Loading data containing newlines =20 Hi Bryan, =20 As a matter of interest are you loading text files into local = directories in encrypted format at all and then push it into HDFS/Hive = as ORC? =20 Thanks =20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any responsibility. =20 From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]=20 Sent: 12 January 2016 17:41 To: user@hive.apache.org =20 Subject: Loading data containing newlines =20 We are attempting to load CSV text files (compressed to bz2) containing = newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC = format tables. Data volume is ~1TB/day, we are really trying to avoid = unpacking them to condition the data. =20 A few days of research has us ready to implement custom input/output = formats to handle the ingest. Any other suggestions that may be less = effort with low impact to load times? =20 Thanks, Bryan G. =20 =20 =20 Want to work at Handy? Check out our culture deck and open roles = =20 Latest news at Handy Handy just raised $50m = led by Fidelity =20 =20 =20 Want to work at Handy? Check out our culture deck and open roles = =20 Latest news at Handy Handy just raised $50m = led by Fidelity =20 _____ =20 THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS = CONFIDENTIAL and may contain information that is privileged and exempt = from disclosure under applicable law. If you are neither the intended = recipient nor responsible for delivering the message to the intended = recipient, please note that any dissemination, distribution, copying or = the taking of any action in reliance upon the message is strictly = prohibited. If you have received this communication in error, please = notify the sender immediately. Thank you. ------=_NextPart_001_009C_01D14FED.2C3DDBF0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Thanks Ryan, Very useful to know indeed =

 

Dr Mich = Talebzadeh

 

LinkedIn = =C2=A0https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility.<= o:p>

 

From:<= /b> Ryan Harris = [mailto:Ryan.Harris@zionsbancorp.com]
Sent: 15 January 2016 = 23:31
To: user@hive.apache.org
Subject: RE: Loading = data containing newlines

 

Mich, if you have a toolpath that you can use to pipeline the required = edits to the source file, you can use a chain similar to = this:

 

hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t = ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put = - /etl/${table_name}/load/${orig_filename}.gz

 

to clean up your source input data as you drop it into the initial = external table location that hive will use in a hive based ELT = chain.

 

It really depends on your upstream data path....if data were being = collected by flume, you might be able to clean it up there.  It is = also possible to handle this with custom hive serdes, but it depends on = where you want to write the code and how much existing data you already = have to deal with.

 

Spark is also a very flexible and useful tool for this sort of problem, = as well as numerous advantages when used as an execution engine, but = setting up spark strictly to resolve this issue seems like overkill to = me.

 

 

From: Mich = Talebzadeh [mailto:mich@peridale.co.uk] =
Sent: Friday, January 15, 2016 4:04 PM
To: user@hive.apache.org
Subje= ct: RE: Loading data containing = newlines

 

Ok but I = believe there are other similar approaches.

 

I can take a = raw csv file and customize it using existing shell commands like sed, = awk, cut, grep etc among them getting rid of blank lines or replacing = silly characters.

 

Bottom line I = want to =E2=80=9Ceventually=E2=80=9D store that csv file in a hive table = in a format that I can use sql queries on it.

 

Is that a = viable alternative?

 

Thanks

 

 

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility.<= o:p>

 

From:<= /b> Marcin = Tustin [mailto:mtustin@handybook.com] =
Sent: 15 January 2016 21:51
To: user@hive.apache.org
Subje= ct: Re: Loading data containing newlines

 

You can = open a file as an RDD of lines, and map whatever custom tokenisation = function you want over it; alternatively you can partition down to a = reasonable size and use map_partitions to map the standard python csv = parser over the partitions.

 

In general, the advantage of spark is that you can do = anything you like rather than being limited to a specific set of = primitives. 

 

On Fri, = Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Hi = Marcin,

 

Can you be = specific in what way Spark is better suited for this operation compared = to Hive?

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Marcin = Tustin [mailto:mtustin@handybook.com]
Sent: 15 January = 2016 21:39
To: user@hive.apache.org
Subject: Re: = Loading data containing newlines

 <= /o:p>

I second = this. I've generally found anything else to be disappointing when = working with data which is at all = funky. 

 <= /o:p>

On Wed, Jan = 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com> = wrote:

Time to use = Spark and Spark-Sql in addition to Hive?

It's = probably going to happen sooner or later = anyway.

 <= /o:p>

I sent you = Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): = List[Array[String]]  function using = BZip2CompressorInputStream and Super CSV = API)

you can = download spark,  open spark-shell and run/debug the program on a = single computer

 <= /o:p>

and then = run it on cluster if needed   (e.g. Amazon EMR can spin up Spark = cluster in 7 min)

 <= /o:p>

On Wed, Jan = 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov> = wrote:

1.     &nb= sp; hdfs dfs = -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/

<= span lang=3DEN-US style=3D'color:#44546A'>2.     &nb= sp; CREATE EXTERNAL = TABLE stg_<table> (cols=E2=80=A6) ROW FORMAT serde = 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION = =E2=80=98/data/stg/table/=E2=80=99

3.     &nb= sp; CREATE TABLE = <table> (cols=E2=80=A6) STORE AS ORC  tblproperties = ("orc.compress"=3D"ZLIB");

4.     &nb= sp; INSERT INTO TABLE = <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM = ext_<table>

5.     &nb= sp; Delete files from = hdfs://host.name/data/stg/table/

 

This has been working quite well, = until our newest data contains fields with embedded = newlines.

 

We are now looking into options = further up the pipeline to see if we can condition the data earlier in = the process.

 

From: Mich = Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Wednesday, = January 13, 2016 10:34 AM


To: user@hive.apache.org
Subject: RE: = Loading data containing = newlines

 

Thanks = Brian.

 

Just to clarify do you use = something like below?

 

<= span style=3D'font-family:"Courier New";color:blue'>1.  hdfs dfs -copyFromLocal = /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp=

2.  CREATE = EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, =E2=80=A6) = COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS = ORC

 

Cheers,

 

 

Dr Mich = Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due = shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any = responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
Sent: 13 = January 2016 18:12
To: user@hive.apache.org
Subject: RE: = Loading data containing newlines

 <= /o:p>

We are pushing the compressed text = files into HDFS directory for Hive EXTERNAL table, then using an INSERT = on the table using ORC storage. We are letting Hive handle the ORC file = creation process.

 

From: Mich = Talebzadeh [mailto:mich@peridale.co.uk]
Sent: = Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org
Subject: RE: = Loading data containing newlines

 

Hi = Bryan,

 

As a matter of interest are you = loading text files into local directories in encrypted format at all and = then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich = Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due = shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Ltd, its subsidiaries = nor their employees accept any = responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
Sent: 12 = January 2016 17:41
To: user@hive.apache.org
Subject: Loading = data containing newlines

 <= /o:p>

We are attempting to load CSV text files (compressed to = bz2) containing newlines in fields using EXTERNAL tables and = INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we = are really trying to avoid unpacking them to condition the = data.

 

A few days of research has us ready to implement custom =  input/output formats to handle the ingest.  Any other = suggestions that may be less effort with low impact to load = times?

 

Thanks,

Bryan = G.

 <= /o:p>

 <= /o:p>

 <= /o:p>

W= ant to work at Handy? Check out our culture deck and = open = roles

L= atest news at = Handy

H= andy just raised = $50m led by Fidelity

 

3D"Image

 

 

W= ant to work at Handy? Check out our culture deck and = open roles

L= atest news at Handy

H= andy just raised = $50m led by Fidelity

 

3D"Image


THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING = DOCUMENTS, IS CONFIDENTIAL and may contain information that is = privileged and exempt from disclosure under applicable law. If you are = neither the intended recipient nor responsible for delivering the = message to the intended recipient, please note that any dissemination, = distribution, copying or the taking of any action in reliance upon the = message is strictly prohibited. If you have received this communication = in error, please notify the sender immediately. Thank = you.

------=_NextPart_001_009C_01D14FED.2C3DDBF0-- ------=_NextPart_000_009B_01D14FED.2C3DDBF0 Content-Type: image/jpeg; name="image001.jpg" Content-Transfer-Encoding: base64 Content-ID: /9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0a HBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIy MjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCABkAGQDASIA AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3 ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm p6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA AwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx BhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElK U1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3 uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3+iii gAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKA CiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAK KKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAoo ooAKKKKACiiigAooooAKKKKACiiigD//2Q== ------=_NextPart_000_009B_01D14FED.2C3DDBF0--