spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simeon Simeonov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
Date Tue, 19 Apr 2016 21:00:27 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248627#comment-15248627
] 

Simeon Simeonov commented on SPARK-10574:
-----------------------------------------

[~josephkb] I agree that it would be an improvement. The issue I see with the current patch
is that it would be an incompatible API change in the future (specifying hashing functions
as objects and not by name). If we make just this one change everything else can be handled
with no API changes, e.g., seeds are just constructor parameters or closure variables available
to the hashing function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability and maintainability
of the Spark codebase, which has high code change velocity from a large number of committers,
which contributes to a high issue rate. The simplest way to battle this is one design decision
at a time. The PR hard-codes what is essentially a strategy pattern in the implementation
of an object. It conflates responsibilities. It introduces branching this makes testing and
documentation more complicated. If hashing functions are externalized, they could be trivially
tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much
more simply with any function. The documentation and the APIs become simpler to document because
they do one thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the
complexity in what I propose? We have ample examples of Spark APIs using functions as inputs
so there are standard ways to handle this across languages. We don't need a custom trait if
we stick to {{Any}} as the hashing function input. What else could be a problem?

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Assignee: Yanbo Liang
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant
problems with this.
> First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any]
for {{hashCode}}, the implementation is platform specific. This means that feature vectors
created on one platform may be different than vectors created on another platform. This can
create significant problems when a model trained offline is used in another environment for
online prediction. The problem is made harder by the fact that following a hashing transform
features lose human-tractable meaning and a problem such as this may be extremely difficult
to track down.
> Second, the native Scala hashing function performs badly on longer strings, exhibiting
[200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than,
for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
which is also included in the standard Scala libraries and is the hashing choice of fast learners
such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to
very short, dictionary-like strings the hashing function choice will not be a big problem
but why have an implementation in MLlib with this limitation when there is a better implementation
readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that this is a good
change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a previous version
would have to be re-trained. This introduces a problem that's orthogonal to breaking changes
in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous
version. Is there a policy or best practice currently in effect about this? If not, perhaps
we should come up with a few simple rules about how we communicate these in release notes,
etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message