spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24133) Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
Date Thu, 03 May 2018 11:18:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462287#comment-16462287
] 

Apache Spark commented on SPARK-24133:
--------------------------------------

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/21227

> Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24133
>                 URL: https://issues.apache.org/jira/browse/SPARK-24133
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ala Luszczak
>            Assignee: Ala Luszczak
>            Priority: Major
>             Fix For: 2.4.0
>
>
> ColumnVectors store string data in one big byte array. Since the array size is capped
at just under Integer.MAX_VALUE, a single ColumnVector cannot store more than 2GB of string
data.
> However, since the Parquet files commonly contain large blobs stored as strings, and
ColumnVectors by default carry 4096 values, it's entirely possible to go past that limit.
> In such cases a negative capacity is requested from WritableColumnVector.reserve(). The
call succeeds (requested capacity is smaller than already allocated), and consequently  java.lang.ArrayIndexOutOfBoundsException
is thrown when the reader actually attempts to put the data into the array.
> This behavior is hard to troubleshoot for the users. Spark should instead check for negative
requested capacity in WritableColumnVector.reserve() and throw more informative error, instructing
the user to tweak ColumnarBatch size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message