arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: OversizedAllocationException for pandas_udf in pyspark
Date Fri, 01 Mar 2019 17:40:26 GMT
Is there a limitation that a single column cannot be more than 1-2G ?
One of my columns definitely would be around 1.5GB of memory.

I cannot split my DF into more partitions as I have only 1 ID and I'm
grouping by that ID.
So, the UDAF would only run on a single pandasDF
I do have a requirement to make a very large DF for this UDAF (8GB as i
mentioned above) - trying to figure out what I need to do here to make this
work.
Increasing RAM, etc. is no issue (i understand I'd need huge executors as I
have a huge data requirement). But trying to figure out how much to
actually get - cause 20GB of RAM for the executor is also erroring out
where I thought ~10GB would have been enough



On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn <uwelk@xhochy.com> wrote:

> Hello Abdeali,
>
> a problem could here be that a single column of your dataframe is using
> more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame
> into more partitions before applying the UDAF.
>
> Cheers
> Uwe
>
> On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> > I was using arrow with spark+python and when I'm trying some pandas-UDAF
> > functions I am getting this error:
> >
> > org.apache.arrow.vector.util.OversizedAllocationException: Unable to
> > expand
> > the buffer
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> > at
> >
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> > at
> >
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> > at
> >
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
> > at
> >
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
> > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
> > at
> >
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> >
> > I was initially getting a RAM is insufficient error - and theoretically
> > (with no compression) realized that the pandas DataFrame it would try to
> > create would be ~8GB (21million records with each record having ~400
> > bytes). I have increased my executor memory to be 20GB per executor, but
> am
> > now getting this error from Arrow.
> > Looking for some pointers so I can understand this issue better.
> >
> > Here's what I am trying. I have 2 tables with string columns where the
> > strings always have a fixed length:
> > *Table 1*:
> >     id: integer
> >    char_column1: string (length = 30)
> >    char_column2: string (length = 40)
> >    char_column3: string (length = 10)
> >    ...
> > In total, in table1, the char-columns have ~250 characters
> >
> > *Table 2*:
> >     id: integer
> >    char_column1: string (length = 50)
> >    char_column2: string (length = 3)
> >    char_column3: string (length = 4)
> >    ...
> > In total, in table2, the char-columns have ~150 characters
> >
> > I am joining these tables by ID. In my current dataset, I have filtered
> my
> > data so only id=1 exists.
> > Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
> > Hence, total number of records (after joining) for table1_join2 = 400 *
> 50k
> > = 20*10^6 records
> > Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB
> >
> > Now, when I try an executor with 20GB RAM, it does not work.
> > Is there some data duplicity happening internally ? What should be the
> > estimated RAM I need to give for this to work ?
> >
> > Thanks for reading,
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message