Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <AD974CC9-3B96-435B-8C80-C4447FCEE265@besquared.net>
From: Josh Ferguson <josh@besquared.net>
To: hive-user@hadoop.apache.org
In-Reply-To: <34fd060d0901102338s65bf578cg579f30f38902aaef@mail.gmail.com>
Content-Type: multipart/alternative; boundary=Apple-Mail-6-15148449
Mime-Version: 1.0 (Apple Message framework v929.2)
Subject: Re: Overwrite into table using only custom mapper
Date: Sun, 11 Jan 2009 17:56:58 -0800
References: <19a16e8d0901091152w1049d1h7068f6986ac7ba60@mail.gmail.com>
 <19a16e8d0901101216r4eecab22scb4c584672eed943@mail.gmail.com>
 <19a16e8d0901102225te5f8231v44258a97290a69bd@mail.gmail.com>
 <34fd060d0901102246p213ad97eke8618aec3f90e76e@mail.gmail.com>
 <0A867331-1FAB-4C36-AF68-F17D646022B1@besquared.net>
 <34fd060d0901102338s65bf578cg579f30f38902aaef@mail.gmail.com>


--Apple-Mail-6-15148449
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit

I just want to note that conceptually using a transform like this  
seems like it is a series of 3 steps:

1) dump
2) some stuff in the middle the semantic analyzer shouldn't care about
3 ) load

I'm not really sure what the difference is between what is actually  
happening and those steps, but I know that it feels wrong to have to  
specify my delimiting information in multiple places and in every  
script I want to run that inserts data into a table containing a  
column with a MAP or a LIST type.

Maybe hive isn't set up like this right now. I think all the  
information you need to do any of those steps (even repeatedly) is  
already available somewhere long before this query is ever run, so I  
should be able to use that information and not have to specify it  
again every time I want to run a query like this.

Maybe there is something I'm missing? What was the use case for the  
"multiple processing steps" that you mentioned in your last email?

Josh F.

On Jan 10, 2009, at 11:38 PM, Zheng Shao wrote:

> I don't think it's a good idea to rely on the information from the  
> table. The data might go through multiple processing steps before it  
> reaches the final destination table. And the destination table may  
> store the data in any way (may not be delimited).
>
>
> What about allowing some syntax like this:
>
> SELECT TRANSFORM(myint, mymap) ROW FORMAT DELIMITED KEY TERMINATED  
> BY '3' COLLECTION ITEM TERMINATED BY '2'
> USING '/bin/cat'
> AS (myint INT, mymap MAP<STRING,STRING>) ROW FORMAT DELIMITED KEY  
> TERMINATED BY '3' COLLECTION ITEM TERMINATED BY '2'
>
> The first ROW FORMAT describes the input format for the script, and  
> the second describes the output format of the script.
>
>
> Zheng
>
> On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson <josh@besquared.net>  
> wrote:
> My initial assumption when I tried to write the query was that it  
> would use the same delimiters I defined in the schema definition of  
> the target table. That led to my confusion because I thought hive  
> had enough information (in the schema) to do proper string ->  
> map<x,x> data conversion.
>
> Maybe something like that could work?
>
> Josh F.
>
>
> On Jan 10, 2009, at 10:46 PM, Zheng Shao wrote:
>
>> Hi Josh,
>>
>> Yes the transform assumes every output column will be string.
>> And, if the input of the transform is not a string, it will be  
>> converted to a string. But we don't have a mechanism to convert  
>> string back to map<string,string> in the language. What do you  
>> think we should do to support that?
>>
>> Zheng
>>
>> On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson  
>> <josh@besquared.net> wrote:
>> One more small update, it seems that transform doesn't work at all  
>> for inserting into columns of type MAP<X,Y>. I suspect this is  
>> because the semantic analyzer treats all columns out of a custom  
>> map phase as type 'STRING' and then complains when it can't convert  
>> the assumed type into the type necessary, which is MAP<STRING,  
>> STRING> in this case. Is this correct? Is anyone else using a MAP  
>> type with custom map or reduce scripts? What queries have you  
>> gotten to work?
>>
>> Jos
>>
>>
>> On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson  
>> <josh@besquared.net> wrote:
>> I want to follow up on this a little, here are the schemas for the  
>> source and destination tables and the query I am trying to run.
>>
>> Source table:
>>
>> hive> DESCRIBE EXTENDED  
>> users 
>> ;                                                                                                                                                  OK
>> occurred_at	int
>> id	string
>> properties	map<string,string>
>> account	string
>> application	string
>> dataset	string
>> hour	int
>> Detailed Table Information:
>> Table(tableName:users,dbName:default,owner:Josh,createTime: 
>> 1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: 
>> [FieldSchema(name:occurred_at,type:int,comment:null),  
>> FieldSchema(name:id,type:string,comment:null),  
>> FieldSchema 
>> (name:properties,type:map<string,string>,comment:null)],location:/ 
>> user/hive/warehouse/ 
>> users 
>> ,inputFormat:org 
>> .apache 
>> .hadoop 
>> .mapred 
>> .TextInputFormat 
>> ,outputFormat:org 
>> .apache 
>> .hadoop 
>> .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets: 
>> 32 
>> ,serdeInfo:SerDeInfo 
>> (name:null 
>> ,serializationLib:org 
>> .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters: 
>> {colelction 
>> .delim 
>> = 
>> 44 
>> ,mapkey 
>> .delim 
>> = 
>> 58 
>> ,serialization 
>> .format 
>> = 
>> org 
>> .apache 
>> .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols: 
>> [id],sortCols:[],parameters:{}),partitionKeys: 
>> [FieldSchema(name:account,type:string,comment:null),  
>> FieldSchema(name:application,type:string,comment:null),  
>> FieldSchema(name:dataset,type:string,comment:null),  
>> FieldSchema(name:hour,type:int,comment:null)],parameters:{})
>>
>>
>> Destination table:
>>
>> hive> DESCRIBE EXTENDED distinct_users;
>> OK
>> occurred_at	int
>> id	string
>> properties	map<string,string>
>> account	string
>> application	string
>> dataset	string
>> hour	int
>> Detailed Table Information:
>> Table(tableName:distinct_users,dbName:default,owner:Josh,createTime: 
>> 1231488500,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: 
>> [FieldSchema(name:occurred_at,type:int,comment:null),  
>> FieldSchema(name:id,type:string,comment:null),  
>> FieldSchema 
>> (name:properties,type:map<string,string>,comment:null)],location:/ 
>> user/hive/warehouse/ 
>> distinct_users 
>> ,inputFormat:org 
>> .apache 
>> .hadoop 
>> .mapred 
>> .TextInputFormat 
>> ,outputFormat:org 
>> .apache 
>> .hadoop 
>> .hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets: 
>> 32 
>> ,serdeInfo:SerDeInfo 
>> (name:null 
>> ,serializationLib:org 
>> .apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameters: 
>> {colelction 
>> .delim 
>> = 
>> 44 
>> ,mapkey 
>> .delim 
>> = 
>> 58 
>> ,serialization 
>> .format 
>> = 
>> org 
>> .apache 
>> .hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols: 
>> [id],sortCols:[],parameters:{}),partitionKeys: 
>> [FieldSchema(name:account,type:string,comment:null),  
>> FieldSchema(name:application,type:string,comment:null),  
>> FieldSchema(name:dataset,type:string,comment:null),  
>> FieldSchema(name:hour,type:int,comment:null)],parameters:{})
>>
>> The query:
>>
>> hive> INSERT OVERWRITE TABLE distinct_users SELECT  
>> TRANSFORM(users.occurred_at, users.id, users.properties) USING '/ 
>> bin/cat' AS (occurred_at, id, properties) FROM users;
>> FAILED: Error in semantic analysis: line 1:23 Cannot insert into  
>> target table because column number/types are different  
>> distinct_users: Cannot convert column 2 from string to  
>> map<string,string>.
>>
>> I'm really confused because the two tables are the exact same  
>> except for their names and I'm just trying to do an insert from one  
>> of them into the other using a script.
>>
>> For reference this appears to work:
>>
>> hive> INSERT OVERWRITE TABLE distinct_users SELECT occurred_at, id,  
>> properties FROM users;
>>
>> What is it about transforming that is messing up the semantic  
>> analysis?
>>
>> Josh Ferguson
>>
>> On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson <josh@besquared.net>  
>> wrote:
>> Is it possible to do a query like the following:
>>
>> INSERT OVERWRITE TABLE table1 PARTITION(...)
>> FROM table2
>> SELECT TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script'  
>> AS (col1, col2, ...)
>> WHERE (...)
>>
>> I can run the select transform segment of the query by itself fine  
>> and I get the results I expect.
>>
>> When I try and do the insert as well I'm getting errors with column  
>> type mismatches even though my script is outputting 3 columns with  
>> the exact same types in the exact order that they appear in table1.  
>> I tried doing this with both a mapper and reducer similar to what  
>> was shown in the Apache Con slides and it still didn't work. Am I  
>> doing something wrong query wise?
>>
>> I'm using the 0.19 release.
>>
>> Josh Ferguson
>>
>>
>>
>>
>>
>> -- 
>> Yours,
>> Zheng
>
>
>
>
> -- 
> Yours,
> Zheng


--Apple-Mail-6-15148449
Content-Type: text/html;
	charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

<html><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; ">I just want to note that =
conceptually using a transform like this seems like it is a series of 3 =
steps:<div><br></div><div>1) dump&nbsp;</div><div>2) some stuff in the =
middle the semantic analyzer shouldn't care about</div><div>3 ) =
load</div><div><br></div><div>I'm not really sure what the difference is =
between what is actually happening and those steps, but I know that it =
feels wrong to have to specify my delimiting information in multiple =
places and in every script I want to run that inserts data into a table =
containing a column with a MAP or a LIST =
type.&nbsp;</div><div><br></div><div>Maybe hive isn't set up like this =
right now. I think all the information you need to do any of those steps =
(even repeatedly) is already available somewhere long before this query =
is ever run, so I should be able to use that information and not have to =
specify it again every time I want to run a query like =
this.</div><div><br></div><div>Maybe there is something I'm missing? =
What was the use case for the "multiple processing steps" that you =
mentioned in your last email?<br><div><br></div><div>Josh =
F.<br><div><br><div><div>On Jan 10, 2009, at 11:38 PM, Zheng Shao =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">I don't think it's a good idea to rely on the information =
from the table. The data might go through multiple processing steps =
before it reaches the final destination table. And the destination table =
may store the data in any way (may not be delimited).<br> <br><br>What =
about allowing some syntax like this:<br><br>SELECT TRANSFORM(myint, =
mymap) ROW FORMAT DELIMITED KEY TERMINATED BY '3' COLLECTION ITEM =
TERMINATED BY '2'<br>USING '/bin/cat'<br>AS (myint INT, mymap =
MAP&lt;STRING,STRING>) ROW FORMAT DELIMITED KEY TERMINATED BY '3' =
COLLECTION ITEM TERMINATED BY '2'<br> <br>The first ROW FORMAT describes =
the input format for the script, and the second describes the output =
format of the script.<br><br><br>Zheng<br><br><div =
class=3D"gmail_quote">On Sat, Jan 10, 2009 at 11:22 PM, Josh Ferguson =
<span dir=3D"ltr">&lt;<a =
href=3D"mailto:josh@besquared.net">josh@besquared.net</a>></span> =
wrote:<br> <blockquote class=3D"gmail_quote" style=3D"border-left: 1px =
solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: =
1ex;"><div style=3D"">My initial assumption when I tried to write the =
query was that it would use the same delimiters I defined in the schema =
definition of the target table. That led to my confusion because I =
thought hive had enough information (in the schema) to do proper string =
-> map&lt;x,x> data conversion.&nbsp;<div> <br></div><div>Maybe =
something like that could work?<div><br></div><div>Josh =
F.<div><div></div><div class=3D"Wj3C7c"><br><div><br><div><div>On Jan =
10, 2009, at 10:46 PM, Zheng Shao wrote:</div><br><blockquote =
type=3D"cite">Hi Josh,<br> <br>Yes the transform assumes every output =
column will be string.<br>And, if the input of the transform is not a =
string, it will be converted to a string. But we don't have a mechanism =
to convert string back to map&lt;string,string> in the language. What do =
you think we should do to support that?<br> <br>Zheng<br><br><div =
class=3D"gmail_quote">On Sat, Jan 10, 2009 at 10:25 PM, Josh Ferguson =
<span dir=3D"ltr">&lt;<a href=3D"mailto:josh@besquared.net" =
target=3D"_blank">josh@besquared.net</a>></span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, =
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> One more small =
update, it seems that transform doesn't work at all for inserting into =
columns of type MAP&lt;X,Y>. I suspect this is because the semantic =
analyzer treats all columns out of a custom map phase as type 'STRING' =
and then complains when it can't convert the assumed type into the type =
necessary, which is MAP&lt;STRING, STRING> in this case. Is this =
correct? Is anyone else using a MAP type with custom map or reduce =
scripts? What queries have you gotten to work?<div> =
<br></div><div>Jos<div><div></div><div><br><br><div =
class=3D"gmail_quote">On Sat, Jan 10, 2009 at 12:16 PM, Josh Ferguson =
<span dir=3D"ltr">&lt;<a href=3D"mailto:josh@besquared.net" =
target=3D"_blank">josh@besquared.net</a>></span> wrote:<br> <blockquote =
class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, =
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> I want to follow =
up on this a little, here are the schemas for the source and destination =
tables and the query I am trying to run.<div> <br></div><div>Source =
table:</div><div><br></div><div><div>hive> DESCRIBE EXTENDED users; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;OK &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<br> =
</div><div>occurred_at<span style=3D"white-space: pre;">	=
</span>int</div><div>id<span style=3D"white-space: pre;">	=
</span>string</div><div>properties<span style=3D"white-space: pre;">	=
</span>map&lt;string,string></div> <div> account<span =
style=3D"white-space: pre;">	</span>string</div><div>application<span =
style=3D"white-space: pre;">	</span>string</div><div>dataset<span =
style=3D"white-space: pre;">	</span>string</div> <div>hour<span =
style=3D"white-space: pre;">	</span>int</div> <div>Detailed Table =
Information:</div><div>Table(tableName:users,dbName:default,owner:Josh,cre=
ateTime:1231485489,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:=
[FieldSchema(name:occurred_at,type:int,comment:null), =
FieldSchema(name:id,type:string,comment:null), =
FieldSchema(name:properties,type:map&lt;string,string>,comment:null)],loca=
tion:/user/hive/warehouse/users,inputFormat:org.apache.hadoop.mapred.TextI=
nputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFo=
rmat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null,serializ=
ationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,parameter=
s:{colelction.delim=3D44,mapkey.delim=3D58,serialization.format=3Dorg.apac=
he.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[id],sortC=
ols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type:string,=
comment:null), FieldSchema(name:application,type:string,comment:null), =
FieldSchema(name:dataset,type:string,comment:null), =
FieldSchema(name:hour,type:int,comment:null)],parameters:{})</div> =
<div><br></div><div><br></div><div>Destination =
table:</div><div><br></div><div><div>hive> DESCRIBE EXTENDED =
distinct_users;</div><div>OK</div><div>occurred_at<span =
style=3D"white-space: pre;">	</span>int</div> <div>id<span =
style=3D"white-space: pre;">	</span>string</div> <div>properties<span =
style=3D"white-space: pre;">	=
</span>map&lt;string,string></div><div>account<span style=3D"white-space: =
pre;">	</span>string</div> <div>application<span style=3D"white-space: =
pre;">	</span>string</div><div> dataset<span style=3D"white-space: =
pre;">	</span>string</div><div>hour<span style=3D"white-space: pre;">	=
</span>int</div> <div>Detailed Table =
Information:</div><div>Table(tableName:distinct_users,dbName:default,owner=
:Josh,createTime:1231488500,lastAccessTime:0,retention:0,sd:StorageDescrip=
tor(cols:[FieldSchema(name:occurred_at,type:int,comment:null), =
FieldSchema(name:id,type:string,comment:null), =
FieldSchema(name:properties,type:map&lt;string,string>,comment:null)],loca=
tion:/user/hive/warehouse/distinct_users,inputFormat:org.apache.hadoop.map=
red.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex=
tOutputFormat,compressed:false,numBuckets:32,serdeInfo:SerDeInfo(name:null=
,serializationLib:org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe,=
parameters:{colelction.delim=3D44,mapkey.delim=3D58,serialization.format=3D=
org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol}),bucketCols:[i=
d],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:account,type=
:string,comment:null), =
FieldSchema(name:application,type:string,comment:null), =
FieldSchema(name:dataset,type:string,comment:null), =
FieldSchema(name:hour,type:int,comment:null)],parameters:{})</div> =
<div><br></div><div>The query:</div><div><br></div><div><div><div>hive> =
INSERT OVERWRITE TABLE distinct_users SELECT =
TRANSFORM(users.occurred_at, <a href=3D"http://users.id" =
target=3D"_blank">users.id</a>, users.properties) USING '/bin/cat' AS =
(occurred_at, id, properties) FROM users;&nbsp;</div> <div>FAILED: Error =
in semantic analysis: line 1:23 Cannot insert into target table because =
column number/types are different distinct_users: Cannot convert column =
2 from string to map&lt;string,string>.</div><div><br> </div> =
</div><div>I'm really confused because the two tables are the exact same =
except for their names and I'm just trying to do an insert from one of =
them into the other using a script.</div><div><br></div><div> For =
reference this appears to work:</div> <div><br></div><div>hive> INSERT =
OVERWRITE TABLE distinct_users SELECT occurred_at, id, properties FROM =
users;<br></div><div><br></div><div>What is it about transforming that =
is messing up the semantic analysis?</div> <div> <br></div><font =
color=3D"#888888"><div>Josh =
Ferguson</div></font></div></div><div><div></div><div><br><div =
class=3D"gmail_quote">On Fri, Jan 9, 2009 at 11:52 AM, Josh Ferguson =
<span dir=3D"ltr">&lt;<a href=3D"mailto:josh@besquared.net" =
target=3D"_blank">josh@besquared.net</a>></span> wrote:<br> <blockquote =
class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, =
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Is it possible to =
do a query like the following:<div><br></div><div>INSERT OVERWRITE TABLE =
table1 PARTITION(...)</div> <div>FROM table2</div><div>SELECT =
TRANSFORM(table2.col1, table2.col2, ...) USING '/my/script' AS (col1, =
col2, ...)</div> <div>WHERE (...)</div><div><br></div><div>I can run the =
select transform segment of the query by itself fine and I get the =
results I expect.</div> <div><br></div><div>When I try and do the insert =
as well I'm getting errors with column type mismatches even though my =
script is outputting 3 columns with the exact same types in the exact =
order that they appear in table1.&nbsp;I tried doing this with both a =
mapper and reducer similar to what was shown in the Apache Con slides =
and it still didn't work.&nbsp;Am I doing something wrong query =
wise?</div> <div><br></div><div>I'm using the 0.19 =
release.</div><div><br></div><font color=3D"#888888"><div>Josh =
Ferguson</div> </font></blockquote></div><br></div></div></div> =
</blockquote></div><br></div></div></div> </blockquote> </div><br><br =
clear=3D"all"><br>-- =
<br>Yours,<br>Zheng<br></blockquote></div><br></div></div></div></div></di=
v></div></blockquote></div><br><br clear=3D"all"><br>-- =
<br>Yours,<br>Zheng<br></blockquote></div><br></div></div></div></body></h=
tml>=

--Apple-Mail-6-15148449--