Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6459810DAC for ; Tue, 30 Jul 2013 02:11:52 +0000 (UTC) Received: (qmail 1232 invoked by uid 500); 30 Jul 2013 02:11:51 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 1180 invoked by uid 500); 30 Jul 2013 02:11:51 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 1172 invoked by uid 500); 30 Jul 2013 02:11:51 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 1169 invoked by uid 99); 30 Jul 2013 02:11:51 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jul 2013 02:11:51 +0000 Date: Tue, 30 Jul 2013 02:11:51 +0000 (UTC) From: "Chaoyu Tang (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-4223) LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of hive table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaoyu Tang updated HIVE-4223: ------------------------------ Attachment: nest_struct.data data file to my test case -- chaoyu > LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of hive table > ------------------------------------------------------------------------------------ > > Key: HIVE-4223 > URL: https://issues.apache.org/jira/browse/HIVE-4223 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.9.0 > Environment: Hive 0.9.0 > Reporter: Yong Zhang > Attachments: nest_struct.data > > > The LazySimpleSerDe will throw IndexOutOfBoundsException if the column structure is struct containing array of struct. > I have a table with one column defined like this: > columnA > array < > struct< > col1:primiType, > col2:primiType, > col3:primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:array< > struct< > col1:primiType, > col2::primiType, > col3::primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:primiType, > col9:primiType > > > > > > > > > In this example, the outside struct has 8 columns (including the array), and the inner struct has 9 columns. As long as the outside struct has LESS column count than the inner struct column count, I think we will get the following exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row: > Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) > at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) > at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531) > ... 9 more > I am not very sure about exactly the reason of this problem. I believe that the public static void serialize(ByteStream.Output out, Object obj,ObjectInspector objInspector, byte[] separators, int level, Text nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is recursively invoking itself when facing nest structure. But for the nested struct structure, the list reference will mass up, and the size() will return wrong data. > In the above example case I faced, > for these 2 lines: > List fields = soi.getAllStructFieldRefs(); > list = soi.getStructFieldsDataAsList(obj); > my StructObjectInspector(soi) will return the CORRECT data for getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, for one row, for the outsider 8 columns struct, I have 2 elements in the inner array of struct, and each element will have 9 columns (as there are 9 columns in the inner struct). During runtime, after I added more logging in the LazySimpleSerDe, I will see the following behavior in the logging: > for 8 outside column, loop > for 9 inside columns, loop for serialize > for 9 inside columns, loop for serialize > code broken here, for the outside loop, it will try to access the 9th element,which not exist in the outside loop, as you will see the stracktrace as it tried to access location 8 of size 8 of list. > What I did is to change the following line of code, it look like fixing this problem. But I don't know if it is the right way, but it did fix this problem, and I did it on hive 0.9.0 version of code: > 481c481,482 > < for (int i = 0; i < list.size(); i++) { > --- > > int listSize = list.size(); > > for (int i = 0; i < listSize; i++) { > I believe the reason of this bug is that if the code did the current way like > for (int i = 0; i < list.size(); i++) > the method list.size() will be invoked for every loop. But in the nest structure, the list.size() will return different result during the recursive call, and that caused the problem I am currently facing. > Thanks > Yong Zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira