mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gokhan Capan (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
Date Mon, 14 Apr 2014 07:58:19 GMT


Gokhan Capan commented on MAHOUT-1178:

Well I can add this, but considering the current status of the project, I think this is no
longer in people's interest.
What do you say [~ssc], should we 'won't fix' it or commit?

> GSOC 2013: Improve Lucene support in Mahout
> -------------------------------------------
>                 Key: MAHOUT-1178
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Dan Filimon
>            Assignee: Gokhan Capan
>              Labels: gsoc2013, mentor
>             Fix For: 1.0
>         Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

This message was sent by Atlassian JIRA

View raw message