spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asher Krim <ak...@hubspot.com>
Subject Spark 1.6.3 Driver OOM on createDataFrame
Date Sun, 22 Jan 2017 17:35:11 GMT
Hi All,

There seems to be a bug in Spark 1.6.3 which causes the driver to OOM when
creating a dataframe using a lot of data in memory on the driver. Examining
a heap dump, it looks like the driver is filled with multiple copies of the
data. The following java code reproduces the bug:

  public void run() {
    try (JavaSparkContext sc = getSparkContext()) {
      SQLContext sqlContext = new SQLContext(sc);

      DataFrame df = sqlContext.createDataFrame(generateData().stream()
              .map(floats -> RowFactory.create(floats))
              .collect(Collectors.toList()),
          DataTypes.createStructType(new StructField[] { VECTOR_FIELD }));

      LOG.info("successfully parallelized {} rows", df.count());
    }

  }

private List<List<Float>> generateData() { List<List<Float>> data
= new
ArrayList<>(3_000_000); for (int i = 0; i < 3_000_000; i++) { List<Float>
row = new ArrayList<>(300); for (int j = 0; j < 300; j++) { row.add(random.
nextFloat()); } data.add(row); }


Increasing the driver memory to insane values (28g) doesn't help. I tested
in Spark 2 and the problem seems to have been solved, however I'm not sure
which issue is responsible for solving it. I assume it's one of these:
https://issues.apache.org/jira/browse/SPARK-12511?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%202.0.0%20AND%20text%20~%20%22OOM%22

The reason this is an issue is because some machine learning models are
represented as large-ish local data structures on the driver, so this bug
is encountered while attempting to save them. Unfortunately, using mllib
instead of ml is not an option since some mllib algorithms also rely on the
dataframe API for persisting the model (such as word2vec and LDA), even
though mllib is supposed to be based on RDDs. This makes these algorithms
unusable for anything larger than toy examples in < Spark 2.

If anyone is familiar with this bug, I would really appreciate it if they
could point me in the direction of the pr that fixed it.

Is a 1.6.4 release planned?
Would be possible to backport the dataframe bugfix?

Thanks,
Asher Krim
Senior Software Engineer

Mime
View raw message