spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Andras <programminggee...@gmail.com>
Subject Spark SQL / Parquet - Dynamic Schema detection
Date Mon, 14 Mar 2016 16:35:15 GMT
Hello there,

I am trying to write a program in Spark that is attempting to load multiple
json files (with undefined schemas) into a dataframe and then write it out
to a parquet file. When doing so, I am running into a number of garbage
collection issues as a result of my JVM running out of heap space. The heap
space that I have on both my driver and executors seems reasonable
considering the size of the files that I am reading in. I was wondering if
anyone could provide me some insight if this could be due to the schema
inference that the parquet conversion is doing (am I trying to do something
that it is not designed to do) of if I should be looking elsewhere for
performance issues? If it is due to schema inference, is there anything I
should be looking at that might alleviate the pressure on the garbage
collector and heap?

<code snippet>

val fileRdd = sc.textFile(new Path("hdfsDirectory/*").toString)

val df = sqlContext.read.json(fileRdd)

df.write.parquet("hdfsDirectory/file.parquet")

</code snippet>

Each json file is of a single object and has the potential to have variance
in the schema. They are around the size of about 3-5KB each. My JVM heap
memory for each executor and driver is around 1GB. The number of files I am
trying to take in pushes quite a few thousand. One approach I had
considered was to merge these files together into a single json document
then attempt to load the merged document into a dataframe, then write out a
parquet file, but I am not sure this will be of any benefit.

Any advice or guidance that anyone could provide me would be greatly
appreciated,

Mime
View raw message