mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Isabel Drost (JIRA)" <>
Subject [jira] Commented: (MAHOUT-124) Online Classification using HBase
Date Tue, 11 Aug 2009 19:40:14 GMT


Isabel Drost commented on MAHOUT-124:

Alltogether really nice changes. The patch now applies to trunk without problems and builds
(except for the missing hbase dependency). As this will be one of the last reviews, I tried
to be a little more picky also with minor changes like added System.out.println and missing

The ant config file (build.xml) contains changes that I see nowhere explained. Are they supposed
to remain for the final patch?

In the examples concerning the TestClassifier
  - it has imports for* and java.util.* - for the final patch could you please revert
those to the specific imports?
  - could you please try to avoid reformatting the code as much as possible? It makes reading
patches a whole lot easier.
  - in line 129 there is quite a bit of code commented out - better through it out entirely?
If needed later the snippet is still in jira.
  - line 224 - have the timing statistics been left in intentionally?

  - The class is missing documentation. I guess your intention was to generate nGrams from
a line of text, not the whole document? Otherwise holding document and nGrams both in memory
seems a little bit much. There also seems to be no unit test for it?

The classes implementing the caching algorithms are missing documentation. At least some /**
{@inheritDoc} */ and a short comment on top that explains the purpose of the implemention
would be nice. (Same applies for Pair and Parameters).

CBayesNormalizerReducer still has HBase Dependencies - is it possible to factor them out?

BayesThetaNormalizerDriver - setting the number of map tasks was commented away compared with
trunk. Intentional?

BayesClassifierMapper - lines 106, 110 and following: Shouldn't the log message be something
like "Using ..." instead of "Testing ..."?

classifier/bayes/interfaces/algorithm/Algorithm - you still give a pointer to the datastore
with every method call to the Algorithm. Wouldn't the interface look cleaner if the Algorithm
would hold a reference to an initialized datastore and use that for further requests? I don't
think it is very likely that users will go to HBase for the first document to classify and
to an InMemoryStore for the next document.

bayes/algorithm/CBayesAlgorithm, BayesAlgorithm, bayes/common/ClassifierPriorityQueue - is
missing some basic javaDoc.

BayesTfIdfDriver, BayesTfIdfReducer, BayesWeightSummerReducer - I assume the dependency to
HBase cannot be factored out?

BayesFeatureMapper - there is a System.out.println in there...

One last question: You reference hbase-0.20.0 which is not released yet. I guess we should
include a prebuilt version in our lib directory and ship that until hbase has an official
release to use?

> Online Classification using HBase
> ---------------------------------
>                 Key: MAHOUT-124
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Isabel Drost
>             Fix For: 0.2
>         Attachments: MAHOUT-124-August-2.patch, MAHOUT-124-July-13.patch, MAHOUT-124-July-23.patch,
MAHOUT-124-July-6.patch, MAHOUT-124-June-23.patch
> #       Batch classification of flat file documents and flat file model:
> #       Storing the model in HBase and the end of Model Building Map/Reduce stages
> #       Using the model stored in HBase create an interface (both command line and web
service) to classify a give document
> #       Using the model stored in HBase, batch classify documents stored on the HDFS

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message