Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Date: Wed, 10 Apr 2013 22:57:16 +0000 (UTC)
From: "Ted Dunning (JIRA)" <jira@apache.org>
To: dev@mahout.apache.org
Message-ID: <JIRA.12639795.1364555387355.146125.1365634636842@arcas>
In-Reply-To: <JIRA.12639795.1364555387355@arcas>
References: <JIRA.12639795.1364555387355@arcas>
Subject: [jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support
 in Mahout
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628397#comment-13628397 ] 

Ted Dunning commented on MAHOUT-1178:
-------------------------------------

{quote}
Ted, do you think this should load the entire index to memory as a matrix? Or should it ask to the index when a get request is done? (And if this is the option, should set methods also update the lucene index itself?)
{quote}

My own interests would be 

a) a flexible schema (you satisfied that with your proposed implementation)

b) a fast iterator that gives me sparse vectors for each document in the index in index order.  If I get multiple iterators, one for each matrix view of the index, that is just fine.

You have added potential additional operations

c) getRow(int rowNumber /* not doc id */) and get(int rowNumber, int colNumber)

d) putRow(int rowNumber, Vector doc)

I don't know the value of (c) since the rowNumber has little external meaning.

I think that (d) is pretty much impossible to do given the difficulty of reverse engineering the vector.  I could be wrong and that would be intriguing.  We would need the ability to independently update different matrix views of the index to update different fields.  If possible, it is kind of cool.

One addition that I would think *very* interesting/helpful would be to adjust (b) to provide the same thing, but for a query result rather than the entire index.
                
> GSOC 2013: Improve Lucene support in Mahout
> -------------------------------------------
>
>                 Key: MAHOUT-1178
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Dan Filimon
>              Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira