mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?
Date Thu, 21 Apr 2011 22:04:46 GMT
It is definitely a reasonable idea to convert data to hashed feature vectors
using map-reduce.

And yes, you can pick a vector length that is long enough so that you don't
have to worry about
collisions.  You need to examine your data to decide how large that needs to
be, but it isn't hard
to do.  The encoding framework handles to the placement of features in the
vector for you.  You
don't have to worry about that.

On Wed, Apr 20, 2011 at 8:03 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:

> Thanks Ted. Since the SGD is a sequential method, so the Vector be created
> for each line could be very large and won't consume too much memory. Could
> I
> assume if we have limited number of features, or could use the map-reduce
> to
> pre-process the data to know how many different values in a category could
> have, we could just create a long vector, and put different feature values
> to different slot to avoid the possible feature collision?
>
> Thanks,
> Stanley
>
>
>
> On Thu, Apr 21, 2011 at 12:24 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Stanley,
> >
> > Yes.  What you say is correct.  Feature hashing can cause degradation.
> >
> > With multiple hashing, however, you do have a fairly strong guarantee
> that
> > the feature hashing is very close to information preserving.  This is
> > related to the fact that the feature hashing operation is a random linear
> > transformation.  Since we are hashing to something that is still quite a
> > high dimensional space, the information loss is likely to be minimal.
> >
> > On Wed, Apr 20, 2011 at 6:06 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> >
> > > Dear all,
> > >
> > > Per my understand, what Feature Hashing did in SGD do compress the
> > Feature
> > > Dimensions to a fixed length Vector. Won't that make the training
> result
> > > incorrect if Feature Hashing Collision happened? Won't the two features
> > > hashed to the same slot would be thought as the same feature? Even if
> we
> > > have multiple probes to reduce the total collision like a bloom filter.
> > > Won't it also make the slot that has the collision looks like a
> > combination
> > > feature?
> > >
> > > Thanks.
> > >
> > > Best wishes,
> > > Stanley Xu
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message