mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Palumbo (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents
Date Tue, 27 May 2014 21:52:04 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Palumbo updated MAHOUT-1564:
-----------------------------------

    Description: 
MapReduce Naive Bayes implementation currently lacks the ability to classify a new document
(outside of the training/holdout corpus).  I've begun some work on a "ClassifyNew" job which
will do the following:

1. Vectorize a new text document using the dictionary and document frequencies from the training/holdout
corpus 
    - assume the original corpus was vectorized using `seq2sparse`; step (1) will use all
of the same parameters. 

2. Score and label a new document using a previously trained model.

I think that it will be a useful addition to the NB package.  Unfortunately, this is going
to be mostly MR workhorse code and doesn't really introduce much new logic. I will try to
keep any new logic separate from MR code so that it can be called from scala for MAHOUT-1493.

  was:
MapReduce Naive Bayes implementation currently lacks the ability to classify a new document
(outside of the training/holdout corpus).  I've begun some work on a "ClassifyNew" job which
will do the following:

1. Vectorize a new text document using the dictionary and document frequencies from the training/holdout
corpus 
    - assuming the original corpus was vectorized using `seq2sparse`, step (1) will use all
of the same parameters. 

2. Score and Label a new document using a previously trained model.

I think that it will be a useful addition to the NB package.  Unfortunately, this is going
to be mostly MR workhorse code and doesn't really introduce much new logic. I will try to
keep any new logic separate from MR code so that it can be called from scala for MAHOUT-1493.


> Naive Bayes Classifier for New Text Documents
> ---------------------------------------------
>
>                 Key: MAHOUT-1564
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1564
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> MapReduce Naive Bayes implementation currently lacks the ability to classify a new document
(outside of the training/holdout corpus).  I've begun some work on a "ClassifyNew" job which
will do the following:
> 1. Vectorize a new text document using the dictionary and document frequencies from the
training/holdout corpus 
>     - assume the original corpus was vectorized using `seq2sparse`; step (1) will use
all of the same parameters. 
> 2. Score and label a new document using a previously trained model.
> I think that it will be a useful addition to the NB package.  Unfortunately, this is
going to be mostly MR workhorse code and doesn't really introduce much new logic. I will try
to keep any new logic separate from MR code so that it can be called from scala for MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message