arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <uw...@xhochy.com>
Subject Re: OversizedAllocationException for pandas_udf in pyspark
Date Fri, 01 Mar 2019 16:55:01 GMT
Hello Abdeali,

a problem could here be that a single column of your dataframe is using more than 2GB of RAM
(possibly also just 1G). Try splitting your DataFrame into more partitions before applying
the UDAF.

Cheers
Uwe

On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> I was using arrow with spark+python and when I'm trying some pandas-UDAF
> functions I am getting this error:
> 
> org.apache.arrow.vector.util.OversizedAllocationException: Unable to 
> expand
> the buffer
> at
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> at
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> at
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> 
> I was initially getting a RAM is insufficient error - and theoretically
> (with no compression) realized that the pandas DataFrame it would try to
> create would be ~8GB (21million records with each record having ~400
> bytes). I have increased my executor memory to be 20GB per executor, but am
> now getting this error from Arrow.
> Looking for some pointers so I can understand this issue better.
> 
> Here's what I am trying. I have 2 tables with string columns where the
> strings always have a fixed length:
> *Table 1*:
>     id: integer
>    char_column1: string (length = 30)
>    char_column2: string (length = 40)
>    char_column3: string (length = 10)
>    ...
> In total, in table1, the char-columns have ~250 characters
> 
> *Table 2*:
>     id: integer
>    char_column1: string (length = 50)
>    char_column2: string (length = 3)
>    char_column3: string (length = 4)
>    ...
> In total, in table2, the char-columns have ~150 characters
> 
> I am joining these tables by ID. In my current dataset, I have filtered my
> data so only id=1 exists.
> Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
> Hence, total number of records (after joining) for table1_join2 = 400 * 50k
> = 20*10^6 records
> Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB
> 
> Now, when I try an executor with 20GB RAM, it does not work.
> Is there some data duplicity happening internally ? What should be the
> estimated RAM I need to give for this to work ?
> 
> Thanks for reading,
>

Mime
View raw message