mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Lucene Mahout: Class Discovery (page edited)
Date Wed, 17 Dec 2008 08:30:00 GMT
Class Discovery (MAHOUT) edited by abdelhakim deneche


h1. Intro


CDGA uses a Genetic Algorithm to discover a classification rule for a given dataset. 
A dataset can be seen as a table:

|| ||attribute 1||attribute 2||...||attribute N||
|row 1|value1|value2|...|valueN|
|row 2|value1|value2|...|valueN|
|row M|value1|value2|...|valueN|

An attribute can be numerical, for example a "temperature" attribute, or categorical, for
example a "color" attribute. For classification purposes, one of the categorical attributes
is designated as a *label*, which means that its value defines the *class* of the rows.
A classification rule can be represented as follows:
|| ||attribute 1||attribute 2||...||attribute N||

For a given *target* class and a weight *threshold*, the classification rule can be read :

for each row of the dataset
  if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1 rule.value1))
     (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2 rule.value2))
     (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN rule.valueN))
    row is part of the target class

*Important:* The label attribute is not evaluated by the rule.

The threshold parameter allows some conditions of the rule to be skipped if their weight is
too small. The operators available depend on the attribute types:
* for a numerical attributes, the available operators are '<' and '>='
* for categorical attributes, the available operators are '!=' and '=='

The "threshold" and "target" are user defined parameters, and because the label is always
a categorical attribute, the target is the (zero based) index of the class label value in
all the possible values of the label. For example, if the label attribute can have the following
values (blue, brown, green), then a target of 1 means the "blue" class.

For example, we have the following dataset (the label attribute is "Eyes Color"):
|| ||Age||Eyes Color||Hair Color||
|row 1|16|brown|dark|
|row 2|25|green|light|
|row 3|12|blue|light|
and a classification rule:
and the following parameters: threshold = 1 and target = 0 (brown).

This rule can be read as follows:
for each row of the dataset
  if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
     (1 < 1 || (1 >= 1 && row.value2 != light)) then
    row is part of the "brown Eye Color" class

Please note how the rule skipped the label attribute (Eye Color), and how the first condition
is ignored because its weight is < threshold.

h1. Running the example:
NOTE: Substitute in the appropriate version for the Mahout JOB jar

# cd <MAHOUT_HOME>/examples
# ant job
# {code}<HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/src/test/resources/wdbc
# {code}<HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/src/test/resources/wdbc.infos
# {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job <MAHOUT_HOME>/examples/src/test/resources/wdbc
1 0.9 1 0.033 0.1 0 100 10
{code}   *TODO*: Fill in what these parameters mean.  See the CDGA class javadocs.  Also fill
in where to find the output and what it means.

h1. The info file:
To run properly, CDGA needs some informations about the dataset. Each dataset should be accompanied
by an .infos file that contains the needed informations. for each attribute a corresponding
line in the info file describes it, it can be one of the following:
  if the attribute is ignored
* LABEL, val1, val2,...
  if the attribute is the label (class), and its possible values
* CATEGORICAL, val1, val2,...
  if the attribute is categorial (nominal), and its possible values
* NUMERICAL, min, max
  if the attribute is numerical, and its min and max values

This file can be generated automaticaly using a special tool available with CDGA.

{code}$ <HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job dataset_path{code}

*  the tool searches for an existing infos file (*must be filled by the user*), in the same
directory of the dataset with the same name and with the ".infos" extension, that contain
the type of the attributes:
  ** 'N' numerical attribute
  ** 'C' categorical attribute
  ** 'L' label (this also a categorical attribute)
  ** 'I' to ignore the attribute
  each attribute is in a separate 
* A Hadoop job is used to parse the dataset and collect the informations. This means that
*the dataset can be distributed over HDFS*.
* the results are written back in the same .info file, with the correct format needed by CDGA.

This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences

If you think it was sent incorrectly contact one of the administrators

If you want more information on Confluence, or have a bug to report see

View raw message