mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Will "mahout arff.vector" correctly convert string attributes?
Date Wed, 28 Dec 2011 21:55:50 GMT

On Dec 23, 2011, at 6:21 PM, Donald A. Smith wrote:

> More on conversion from ARFF files:
> Looking at the code in (below), each string in the document is
assigned a separate double (converted from an integer value).  Nominals are treated similarly:
each possible nominal/symbolic value is assigned an integer-valued double. 
> When strings (or nominals) are converted to doubles, it seems to me that the conversion
adds additional irrelevant structure that I don't want.   Depending on the order in which
the strings are added, the assigned doubles will vary.     Adjacent strings in the ordering
will be close together in the metric space/distance measure.  For example, if "john" is 1,
"bob" is 2, and "nancy" is 3, then john is 
> closer to bob than to nancy.    For nominals, that seems wrong.    Most users will probably
really want three binary attributes: one for john, one for bob, and one for nancy.

We could perhaps use the SGD vector encoding stuff here?  

> Am I correct that representing nominals and strings as doubles (in a single attribute)
introduces distracting structure (distance relations)?  Maybe I'm missing something.
> What I may want is to create a different attribute for each possible value of each component
of the URL (counting from the left).   Attribute  component1_1 through component1_k  would
be binary attributes representing the k possible values in the first component of the URL.
Similarly for component2_1, ...  Weka has its own utility class for converting string attributes

> to nominal attributes. That might give me what I want, for path based 
> data. I'd need to preprocess the data.

Or implement your own ARFFModel.

> For URLs I have additional structure: ordering on the URL components.  But if I just
wanted to represent a document as an unordered bag-of-words, then each possible string or
nominal should become a separate binary attribute, doesn't seem
to do the right thing.

We can patch this if you have an alternate implementation.

> Seems like a compressed binary format would be useful for representing such attributes,
unless you also needed a count.
>  Thanks, Don
> --- On Wed, 12/21/11, Grant Ingersoll <> wrote:
>     From: Grant Ingersoll <>
>     Subject: Re: Will "mahout arff.vector" correctly convert string attributes?
>     To:
>     Date: Wednesday, December 21, 2011, 10:09 AM
>     The javadocs on ARFFVectorIterable say:
>     * Attribute type handling:
>     * <ul>
>     * <li>Numeric -> As is</li>
>     * <li>Nominal -> ordinal(value) i.e. @attribute lumber {'\'(-inf-0.5]\'','\'(0.5-inf)\''}
>     * will convert -inf-0.5 -> 0, and 0.5-inf -> 1</li>
>     * <li>Dates -> Convert to time as a long</li>
>     * <li>Strings -> Create a map of String -> long</li>
>     * </ul>
>     The code for this is in MapBackedARFFModel which implements ARFFModel, so I suspect
if it doesn't do exactly as you wish, it can be overridden.
>     On Dec 21, 2011, at 12:37 PM, Donald A. Smith wrote:
>     > Weka's ARFF format allows string attrbutes.
>     >
>     >   @ATTRIBUTE userName string
>     >
>     > Will "mahout arff.vector" correctly handle conversion from such strings to vectors
in such a way that the attribute will, effectively, be treated the same as a nominal attribute?
That is, will the set of strings be converted into a set of nominal attributes (one for each
possible string value)?
>     >
>     >   @ATTRIBUTE userName {bob, fred, harry, jill, betsy, george, bill}
>     >
>     > In general, will I lose any information by using arff.vector?
>     >
>     > For date attributes, will mahout insert derived attributes (hour of day, day
of week)? I presume not and I presume I have to add them myself.
>     >
>     >  Thanks, Don
>     --------------------------------------------
>     Grant Ingersoll

Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message