mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sam wu <swu5...@gmail.com>
Subject Re: Random Forest possible error
Date Sun, 15 Dec 2013 04:41:42 GMT
Running random forest, when loading feature descriptor from JSON file with
ignored features , algorithm fails.
The root cause is in Dataset.java , fromJSON(String json) function
---------------------------

public static Dataset fromJSON(String json) {

    List<Map<String, Object>> fromJSON;

    try {

      fromJSON = OBJECT_MAPPER.readValue(json,
newTypeReference<List<Map<String, Object>>>() {});

    } catch (Exception ex) {

      throw new RuntimeException(ex);

    }

    List<Attribute> attributes = Lists.newLinkedList();

    List<Integer> ignored = Lists.newLinkedList();

    String[][] nominalValues = new String[fromJSON.size()][];

    Dataset dataset = new Dataset();

    for (int i = 0; i < fromJSON.size(); i++) {

      Map<String, Object> attribute = fromJSON.get(i);

      if (Attribute.fromString((String) attribute.get(TYPE)) == Attribute.
IGNORED) {

        ignored.add(i);

      } else {

        Attribute asAttribute = Attribute.fromString((String) attribute.get(
TYPE));

        attributes.add(asAttribute);

        if ((Boolean) attribute.get(LABEL)) {

          dataset.labelId = i - ignored.size();

        }

        if (attribute.get(VALUES) != null) {

          List<String> get = (List<String>) attribute.get(VALUES);

          String[] array = get.toArray(new String[get.size()]);

          nominalValues[i] = array; ----------------------------------line
400, original wrong

          nominalValues[i - ignored.size()] = array;
----------------------------------line 400, new, fix problem

        }

      }

    }

    dataset.attributes = attributes.toArray(newAttribute[attributes.size()]);

    dataset.ignored = new int[ignored.size()];

    dataset.values = nominalValues;

    for (int i = 0; i < dataset.ignored.length; i++) {

      dataset.ignored[i] = ignored.get(i);

    }

    return dataset;

  }

----------------------------------------------------------

****** nominalValues[i] = array; -----------------line 400, original wrong

****** nominalValues[i - ignored.size()] = array;
----------------------------------line 400, new, fix problem

I do several tests on my own data, it works as expected.

I'll file a JIRA, and if no owner, I'll file the patch.


Sam



On Sat, Dec 14, 2013 at 4:05 PM, sam wu <swu5530@gmail.com> wrote:

> Hi Ted,
>
> some more debugging, my previous statement is not correct, please
> dis-regards.
> There is problem i am sure. I am using InMemeoryMapper, one of the ways to
> load data. And I found problem there.
> I am going to compare with other approach (partial, Breiman) to see what's
> the difference.
>
> My bad, well It's Saturday !
>
> Sam
>
>
> On Sat, Dec 14, 2013 at 1:38 PM, Ted Dunning <ted.dunning@gmail.com>wrote:
>
>> Can you file a JIRA at https://issues.apache.org/jira/browse/MAHOUT ?
>>
>> It sounds like you have a test case in mind along with your fix.  If you
>> could package that work up as a patch file, then it would be much
>> appreciated.
>>
>>
>> On Sat, Dec 14, 2013 at 9:24 AM, sam wu <swu5530@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I am using random forest of Mahout. It works well when I don't use
>> feature
>> > descriptor with Ignore feature ( No I flag).
>> >
>> > If using Ignore flag, the returned feature value is -1
>> > (for in the code dataset.valueOf(aId, token) return -1).
>> >
>> > I did some investigation, and found that there some problems in the
>> > DataConverter.java
>> >
>> > source code
>> > ------
>> >
>> >  for (int attr = 0; attr < nball; attr++) {  --51
>> >       if (ArrayUtils.contains(dataset.getIgnored(), attr)) {
>> >         continue; // IGNORED
>> >       }
>> >
>> >       String token = tokens[attr].trim();
>> >
>> >       if ("?".equals(token)) {
>> >         // missing value
>> >         return null;
>> >       }
>> >
>> >  if (dataset.isNumerical(aId)) { --63
>> >         vector.set(aId++, Double.parseDouble(token));
>> >       } else { // CATEGORICAL
>> >         vector.set(aId, dataset.valueOf(aId, token)); --66
>> >         aId++;
>> >       }
>> > -------
>> > Let feature descriptor be 9 I N L (Breiman Example)
>> > 11 features, 1-9 Ignored, 10th is Numeric, 11th is label variable
>> > (Is Breiman example really works  based on web instruction ?)
>> >
>> > line 51 -- attr is #feature, 0-10
>> > aId is filtered feature #, 0-1 ( two non-Ignored features)
>> > Problem in line 66
>> > if attr=10, Label feature
>> > aId=1
>> > token=true
>> > dataset.valueOf(aId, token) return -1 , for current code, CATEGORICAL
>> > feature valueOf() kind mixed aId and attr concept.
>> >
>> > Just by changing line 66
>> > vector.set(aId, dataset.valueOf(aId, token)); --66
>> > to vector.set(aId, dataset.valueOf(attr, token));
>> > not working, because some validation fails (also attr, aId mixture).
>> >
>> >
>> >
>> > There might be things that I overlook, just some thoughts.
>> >
>> >
>> > Sam
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message