mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Annotation based vectorizer
Date Mon, 03 Feb 2014 23:12:16 GMT
Looks nice.

Where is the dictionary injected?

Would type inferencing of the sort used in Guava Lists.newArrayList() help
the verbosity?

What is the type reference used for?

What if the POJO has a Vector in it?  Is there way to deal with that?

How can I vectorize a second (test) data set compatibly with the first?
 (that is, how do I pass the Dictionary to the second case)




On Mon, Feb 3, 2014 at 1:53 PM, Frank Scholten <frank@frankscholten.nl>wrote:

> The second field of Newsgroup should be called bodyText of course.
>
>
> On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <frank@frankscholten.nl
> >wrote:
>
> > Hi all,
> >
> > I put together a utility which vectorizes plain old Java objects
> annotated
> > with @Feature and @Target via Mahout's vector encoders.
> >
> > See my Github branch:
> > https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
> >
> > and the unit test:
> >
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
> >
> > Use it like this:
> >
> > class NewsgroupPost {
> >
> >   @Target
> >   private String newsgroup;
> >
> >   @Feature(encoder = TextValueEncoder.class)
> >   private String newsgroup;
> >
> >   // Getters & setters
> >
> > }
> >
> > AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> > AnnotationBasedVectorizer<NewsgroupPost>(new
> > TypeReference<NewsgroupPost>(){});
> >
> > Here the vectorizer scans the NewsgroupPost's annotations. Then you can
> do
> > this:
> >
> > NewsgroupPost post = ...
> >
> > Vector vector = vectorizer.vectorize(post);
> > int target = vectorizer.getTarget(post);
> > int numFeatures = vectorizer.getNumberOfFeatures();
> >
> > Note that vectorize() and getTarget() methods are genericly typed and due
> > to the type token passed in the constructor we can enforce that only
> > NewsgroupPosts are accepted.
> >
> > The vectorizer uses a Dictionary for encoding the target.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message