mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur (JIRA)" <>
Subject [jira] Commented: (MAHOUT-565) Features incorrectly hashed in Minhash
Date Thu, 16 Dec 2010 13:19:01 GMT


Ankur commented on MAHOUT-565:

> ...The shifted-in bits don't matter right?
You are right. This change is NOT needed. The masking is only needed when we are getting back
an integer from relevant bytes. Somewhere else (not in Mahout's code) I was messing the bytes
up when converting them back to an integer. So out of caution I put this one. This particular
change can be discarded.

> The formatting changes are fine IMHO
Thanks. I set up the code template mentioned on "How to Contribute"

> There are several other changes in this patch, is that intended?
There are 2 noteworthy changes
1. Concatenating hash signatures in a sliding-window fashion. This makes sure that an item
falls into as many buckets as number of hash signatures selected giving it more room for collision
with similar items.
2. Fixing test case in TestMinHashClustering - This was missing evaluation on last cluster.

I haven't had the time to write up the Mahout documentation for this. Also I need to think
about using the results in recommendations context. Any suggestions ?

> Features incorrectly hashed in Minhash
> --------------------------------------
>                 Key: MAHOUT-565
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-565.v1.patch
> Given a feature vector for which minhash signature is desired, each feature id (an integer)
is converted to a byte array through a series of bit shift operations. Current implementation
of these operations doesn't mask the bits being shifted resulting in sign bit being shifted.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message