hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chaoyu Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4223) LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of hive table
Date Tue, 30 Jul 2013 02:11:49 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723276#comment-13723276
] 

Chaoyu Tang commented on HIVE-4223:
-----------------------------------

[~java8964] I was not able to reproduce the said problem in hive-0.9.0 and wondering if it
might be related to the data? Here is my test case;
1. create table bcd (col1 array <struct<col1:string, col2:string, col3:string,col4:string,col5:string,col6:string,col7:string,col8:array<struct<col1:string,col2:string,col3:string,col4:string,col5:string,col6:string,col7:string,col8:string,col9:string>>>>)
row format delimited fields terminated by '\001' collection items terminated by '\002' lines
terminated by '\n' stored as textfile;
** should be same as you described
2. load data local inpath '/root/nest_struct.data' overwrite into table bcd;
** see attached nest_struct.data
3. select col1 from bcd;
** got:
[{"col1":"c1v","col2":"c2v","col3":"c3v","col4":"c4v","col5":"c5v","col6":"c6v","col7":"c7v","col8":[{"col1":"c11v","col2":"c22v","col3":"c33v","col4":"c44v","col5":"c55v","col6":"c66v","col7":"c77v","col8":"c88v","col9":"c99v"}]}]
....

Did you see anything different from your case?
Could you please update your case and probably I can have a try.

 
                
> LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of hive table
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-4223
>                 URL: https://issues.apache.org/jira/browse/HIVE-4223
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.9.0
>         Environment: Hive 0.9.0
>            Reporter: Yong Zhang
>         Attachments: nest_struct.data
>
>
> The LazySimpleSerDe will throw IndexOutOfBoundsException if the column structure is struct
containing array of struct. 
> I have a table with one column defined like this:
> columnA
> array <
>     struct<
>        col1:primiType,
>        col2:primiType,
>        col3:primiType,
>        col4:primiType,
>        col5:primiType,
>        col6:primiType,
>        col7:primiType,
>        col8:array<
>             struct<
>               col1:primiType,
>               col2::primiType,
>               col3::primiType,
>               col4:primiType,
>               col5:primiType,
>               col6:primiType,
>               col7:primiType,
>               col8:primiType,
>               col9:primiType
>             >
>        >
>     >
> >
> In this example, the outside struct has 8 columns (including the array), and the inner
struct has 9 columns. As long as the outside struct has LESS column count than the inner struct
column count, I think we will get the following exception as stracktrace in LazeSimpleSerDe
when it tries to serialize a row:
> Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8
>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>         at java.util.ArrayList.get(ArrayList.java:322)
>         at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
>         at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443)
>         at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
>         at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
>         at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
>         ... 9 more
> I am not very sure about exactly the reason of this problem. I believe that the   public
static void serialize(ByteStream.Output out, Object obj,ObjectInspector objInspector, byte[]
separators, int level, Text nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape)
is recursively invoking itself when facing nest structure. But for the nested struct structure,
the list reference will mass up, and the size() will return wrong data.
> In the above example case I faced, 
> for these 2 lines:
>       List<? extends StructField> fields = soi.getAllStructFieldRefs();
>       list = soi.getStructFieldsDataAsList(obj);
> my StructObjectInspector(soi) will return the CORRECT data for getAllStructFieldRefs()
and getStructFieldsDataAsList() methods. For example, for one row, for the outsider 8 columns
struct, I have 2 elements in the inner array of struct, and each element will have 9 columns
(as there are 9 columns in the inner struct). During runtime, after I added more logging in
the LazySimpleSerDe, I will see the following behavior in the logging:
> for 8 outside column, loop
>     for 9 inside columns, loop for serialize
>     for 9 inside columns, loop for serialize
> code broken here, for the outside loop, it will try to access the 9th element,which not
exist in the outside loop, as you will see the stracktrace as it tried to access location
8 of size 8 of list.
> What I did is to change the following line of code, it look like fixing this problem.
But I don't know if it is the right way, but it did fix this problem, and I did it on hive
0.9.0 version of code:
> 481c481,482
> <         for (int i = 0; i < list.size(); i++) {
> ---
> >         int listSize = list.size();
> >         for (int i = 0; i < listSize; i++) {
> I believe the reason of this bug is that if the code did the current way like
>         for (int i = 0; i < list.size(); i++)
> the method list.size() will be invoked for every loop. But in the nest structure, the
list.size() will return different result during the recursive call, and that caused the problem
I am currently facing.
> Thanks
> Yong Zhang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message