spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.
Date Mon, 11 Sep 2017 07:52:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160862#comment-16160862
] 

Nick Pentreath commented on SPARK-21958:
----------------------------------------

Seems like your proposal could improve things - but yeah let's see what your testing results
are.

> Attempting to save large Word2Vec model hangs driver in constant GC.
> --------------------------------------------------------------------
>
>                 Key: SPARK-21958
>                 URL: https://issues.apache.org/jira/browse/SPARK-21958
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>         Environment: Running spark on yarn, hadoop 2.7.2 provided by the cluster
>            Reporter: Travis Hegner
>              Labels: easyfix, patch, performance
>
> In the new version of Word2Vec, the model saving was modified to estimate an appropriate
number of partitions based on the kryo buffer size. This is a great improvement, but there
is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case class of {{Data(word,
vector)}}... I can only assume this is for the kryo serialization process. The new version
of the code iterates over the entire vocabulary to do this transformation (the old version
wrapped the entire datum) in the driver's heap. Only to have the result then distributed to
the cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, and tri-grams),
that local driver transformation is causing the driver to hang indefinitely in GC as I can
only assume that it's generating millions of short lived objects which can't be evicted fast
enough.
> Perhaps I'm overlooking something, but it seems to me that since the result is distributed
over the cluster to be saved _after_ the transformation anyway, we may as well distribute
it _first_, allowing the cluster resources to do the transformation more efficiently, and
then write the parquet file from there.
> I have a patch implemented, and am in the process of testing it at scale. I will open
a pull request when I feel that the patch is successfully resolving the issue, and after making
sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message