hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: Data are not displayed correctly on hive tables
Date Wed, 09 May 2012 00:20:50 GMT
Hi Roshan, 
You are right. '\n' in your XML content is going to give you problems. The table you created
in Hive assumes one record = 1 '\n' terminated row from your file. I would recommend sanitizing
you data before you load it in to get rid of '\n's.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

----- Original Message -----
From: "mperformer" <codevally.mail.list@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, May 8, 2012 6:22:03 PM
Subject: Re: Data are not displayed correctly on hive tables

Hi Mark 

Thanks for the reply. 

In the HDFS, the row looks like: 

1~Order Conf Req~<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<message scope="" type=""> 
<messageId>0</messageId> 
<creatorId>0</creatorId> 
<creatorType/> 
<workspaceId>0</workspaceId> 
<workspaceType/> 
<createdTime>2006-06-01T17:59:09.413+10:00</createdTime> 
<status>0</status> 
<from/> 
<to/> 
<cc/> 
<bcc/> 
<nickname/><subject>Order Conf </subject> 
<body>Prod 

Qty Amt 

Dlvry Date 

Price 
</body> 
<smspreview/> 
<email bool="true"> 
<type/> 
<body>Prod 

Qty Amt 

Dlvry Date 

Price 
</body> 
<footer/> 
</email> 
<voice bool="true"> 
<type/> 
<header/> 

<body/> 
<footer/> 
<other/> 
</voice><web bool="false"><type></type><header></header><body></body><footer></footer></web>

<notes/> 
<alert bool="false"> 
<value/> 
</alert> 
<dlr bool="true"> 
<type>ReportUnDelivered</type> 
<rule>Read</rule> 
<period>5 mins</period><publishToWeb>false</publishToWeb><expiryValue>0</expiryValue><expiryPeriod>0</expiryPeriod>

</dlr> 
</message> 

~PROJECT~General Message~No description~null~true~2~28~0 

I used "~" as my field separator. But when I do a select query to my hive table, it only shows
the " <?xml version="1.0" encoding="UTF-8" standalone="yes"?>" part from XML data. 

Seems there are new line character (\n) in my XML piece and it is the default row terminator
in HIVE. If this is the case, is there any way to use another character as the row terminator?


Many Thanks. 

/Roshan 



On Mon, May 7, 2012 at 11:38 PM, Mark Grover < mgrover@oanda.com > wrote: 


Hi Roshan, 
The following snippet summarizes the delimiters for your Hive table: 

colelction.delim \u0002 
field.delim \u0001 
mapkey.delim \u0003 
serialization.format \u0001 


Your fields are delimited by \u0001, collections are delimited by \u0002 and the delimiter
between the key and value in any maps is \u0003. Can you verify that your XML content doesn't
contain any of these characters? 

If this still doesn't help, could you pick an affected row and share what the XML appears
as in Hive and what it is expected to be? 

Good luck! 
Mark 


Mark Grover, Business Intelligence Analyst 
OANDA Corporation 

www: oanda.com www: fxtrade.com 

----- Original Message ----- 
From: "mperformer" < codevally.mail.list@gmail.com > 
To: user@hive.apache.org 


Sent: Sunday, May 6, 2012 11:34:55 PM 
Subject: Re: Data are not displayed correctly on hive tables 

Hi Mark 

Many thanks for your reply. Please find the below output. 

hive> describe formatted messagetemplate; 
OK 
# col_name data_type comment 

messagetemplateid bigint None 
messagetemplatename string None 
datacol string None 
messagetemplatetype string None 
messagetype string None 
messagetemplatedescription string None 
originatingtemplateid bigint None 
edited boolean None 
userid bigint None 
projectid bigint None 
responsetemplateid bigint None 

# Detailed Table Information 
Database: default 
Owner: root 
CreateTime: Mon May 07 12:06:59 EST 2012 
LastAccessTime: UNKNOWN 
Protect Mode: None 
Retention: 0 
Location: hdfs://app6:9100/mnt/hive-test/warehouse/messagetemplate 
Table Type: MANAGED_TABLE 
Table Parameters: 
comment This is the messagetemplate table 
transient_lastDdlTime 1336356473 

# Storage Information 
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
InputFormat: org.apache.hadoop.mapred.TextInputFormat 
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
Compressed: No 
Num Buckets: -1 
Bucket Columns: [] 
Sort Columns: [] 
Storage Desc Params: 
colelction.delim \u0002 
field.delim \u0001 
mapkey.delim \u0003 
serialization.format \u0001 
Time taken: 3.2 seconds 

Thanks again. 

./Roshan. 


On Mon, May 7, 2012 at 1:06 PM, Mark Grover < mgrover@oanda.com > wrote: 


Could you share the output of the following command in Hive: 
describe formatted messagetemplate 

My hunch is that your Hive table is using a delimiter (e.g. '\t') that appears in the content
of your XML. 

Mark Grover, Business Intelligence Analyst 
OANDA Corporation 

www: oanda.com www: fxtrade.com 



----- Original Message ----- 
From: "mperformer" < codevally.mail.list@gmail.com > 
To: user@hive.apache.org 
Sent: Sunday, May 6, 2012 8:34:27 PM 
Subject: Data are not displayed correctly on hive tables 

Hi 

I am using 

• Hadoop 0.20.2 
• Hive 0.8.1 
• Sqoop 1.4.1-incubating 



in my sample project. Currently I am importing data from PostgreSQL to Hive table using Sqoop.
My database table in PostgreSQL has 4 columns and one column stores a bit large XML file as
TEXT data type. The same column defined in HIVE as string, but after that column data is not
importing and shows as null; 

Table structure in PostgreSQL 

CREATE TABLE public.messagetemplate ( 
messagetemplateid BIGSERIAL, 
messagetemplatename TEXT, 
data TEXT, 
messagetemplatetype TEXT, 
CONSTRAINT pk_messagetemplate PRIMARY KEY(messagetemplateid) 
) WITHOUT OIDS; 

Table structure in Hive 

hive> desc messagetemplate; 
OK 
messagetemplateid bigint 
messagetemplatename string 
data string 
messagetemplatetype string 


The data column store the XML file as text, but during the import to hive, all data are imported
properly (checked the files in HDFS). But using HIVE select statement, it only shows small
part from the XML text and the rest column (last column) is null. 

Could someone please help me to sort this out. Thanks. 







Mime
View raw message