mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Import Export Sequence File Formats
Date Mon, 12 Sep 2011 02:45:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Import Export Sequence File Formats (https://cwiki.apache.org/confluence/display/MAHOUT/Import+Export+Sequence+File+Formats)

Added by Lance Norskog:
---------------------------------------------------------------------
h5. Status
This is a talk page.
h1. Scope of Project
There are different kinds of import/export problem. One class of problem is defining a set
of SequenceFile formats that a "Mahout Job" will import and export. This page is limited to
the SequenceFile problem.
h1. Use Cases
h3. Lucene "Bag-of-words" vector
This is a NamedVector file containing a String key and a sparse-encoded vector. There may
be an external dictionary defining documents and/or terms.
h5. Import
The various Bayes text classification jobs like Wikipedia import Lucene bag-of-words Vector
files.  
h5. Export
Feature vectors derived from text vectors are useful to text-oriented machine learning research.
An example:
* Compare a feature vector to all of the original text vectors. This searches for "exemplar"
documents which seem to most comprehensively match the given feature. A bunch of papers discuss
this for creating document abstracts from sentence vectors.
h3. Confusion Matrix 
A classification job creates among other things a Confusion Matrix. The current example jobs
log a text version of the confusion matrix.
h5. Import
Comparing confusion matrices from different classification runs lets you evaluate tuning knobs
for a classifier.
h5. Export
Comparing confusion matrices from different classification runs lets you evaluate tuning knobs
for a classifier.





Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message