mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: IDs to longs?
Date Tue, 04 Aug 2009 15:19:18 GMT
The standard in large systems I have worked on is to use a side-store to
retain string to id mappings.  MapFiles would be fine for this.  It is
actually unusual to even use longs rather than ints.  Reducing the data to
primitive objects is sine qua non for large scale.  In fact, the binary case
is typically reduced to a bunch of arrays of ints.  None of the
recommendation algorithms need to know the string form of the id's.

That said, using a hash based system is just fine with 64bits or more.  The
issue is that you still need to map back to strings at the end so it doesn't
save you all that much.  See the random indexing literature if you are
worried about collisions.

On Tue, Aug 4, 2009 at 8:09 AM, Sean Owen <> wrote:

> Maybe I am not thinking this through entirely but I was thinking a
> deterministic mapping from String to long would be preferable, since
> the entire mapping could be recreated from the Strings if needed. If I
> start assigning IDs in order, that mapping has to be saved and synced
> to any component that needs to do the translation. Somehow I am
> guessing that could get tricky. For example a new ID shows up in the
> system in some kind of clustered or distributed system context. Now
> you need to make sure the entire system agrees on which long gets
> assigned to that String.
> But then again you avoid the collision issue -- it has a cost though.
> My gut was that the hash (implicit mapping) was preferable but hadn't
> thought it through entirely. more thoughts?
> But yes I agree the idea is to provide such a component, with a
> in-memory representation and a JDBC-backed representation I imagine.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message