spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peng Meng (JIRA)" <>
Subject [jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS
Date Thu, 20 Jul 2017 02:48:01 GMT


Peng Meng commented on SPARK-2465:

I think it is time to revisit this now.  Some of our customers, such as, ask us to
support Long ID for ALS. Actually, they have more than Int.MaxValue products.  Long ID of
ALS is necessary for them. 
How to you think to reopen your PR? [~srowen]

> Use long as user / item ID for ALS
> ----------------------------------
>                 Key: SPARK-2465
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.0.1
>            Reporter: Sean Owen
>            Priority: Minor
>         Attachments: ALS using MEMORY_AND_DISK.png, ALS using MEMORY_AND_DISK_SER.png,
Screen Shot 2014-07-13 at 8.49.40 PM.png
> I'd like to float this for consideration: use longs instead of ints for user and product
IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and will be
hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are
likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to
64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int can be used
directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.
> Thoughts? I will post a PR so as to show what the change would be.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message