spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS
Date Mon, 14 Jul 2014 19:42:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061107#comment-14061107
] 

Sean Owen commented on SPARK-2465:
----------------------------------

Forgot to add that when I've implemented this, and used longs for IDs, we used a simple zig-zag
variable length encoding for integers. This is because, often, IDs really were numbers and
so tended to be small. Hence an 8-byte long might only take a few bytes on disk. Some serialization
frameworks like protobuf do this kind of thing automatically; we wrote it by hand in Writables.
I know Java doesn't do this, but don't know about Kryo. Anyway, if the serialized size is
the issue (and it's not the only issue), there are maybe ways of getting around that. It doesn't
help if the values really are hashes since the values go all over the range of integers.

> Use long as user / item ID for ALS
> ----------------------------------
>
>                 Key: SPARK-2465
>                 URL: https://issues.apache.org/jira/browse/SPARK-2465
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.1
>            Reporter: Sean Owen
>            Priority: Minor
>         Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user and product
IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and will be
hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are
likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to
64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int can be used
directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message