mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-7) Lucene indexes should act as matrix factories
Date Tue, 17 Nov 2009 08:18:39 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778763#action_12778763
] 

Sean Owen commented on MAHOUT-7:
--------------------------------

Same, no activity in almost 2 years, obsolete?

> Lucene indexes should act as matrix factories
> ---------------------------------------------
>
>                 Key: MAHOUT-7
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-7
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>
> It would be highly desirable to be able to extract virtual matrices from lucene indexes.
> The factory methods that I know of would include:
> a) the factory would accept the name of a single field and the resulting matrix would
use document id's as row labels and terms as column labels.  The values would be the term
counts in the document (if available), or 1 if the term is in the document, but the term frequency
is not available.  This implies that TermVectors could be viewed as rows of this matrix. 
Columns could be extract by boolean retrieval from the index.  Retrieval from that field could
be considered a form of matrix-vector multiplication where the vector encodes a query using
the values as term boosts and the result wraps a hit structure as a sparse matrix.  Matrix-matrix
arithmetic with pairs of this kind of matrix should yield a matrix as in (b).
> b) the factory would accept a linear combination of terms and the resulting matrix would
have rows which are linear combinations of the underlying term-vectors (could this be done
latently so computation is only on access?  would that help?).  Column access would be a form
of retrieval (but what would the semantics be?).  Matrix vector product could again be viewed
as retrieval, but it would probably be most useful to view the original coefficients as boosts
for the Lucene scoring mechanism rather than computing a linear combination of scores.
> c) the factor would produce a matrix in which the rows are all documents and the columns
are all terms from all fields, each labeled with field and term name (probably using lucene
query syntax).  Rows would be the concatenation of all term vectors, columns would represent
retrieval on a single term.  Matrix vector multiplication would be general Lucene retrieval.
 Matrix-matrix operations between lucene indexes should do something interesting (A' A, for
instance might compute term coocurrence), but that seems pretty hairy to specify.  Matrix-matrix
operations with ordinary matrices on the right might best be considered as multiple retrievals
using each column of the right hand matrix as query.
> d) as with (c), but with only a defined list of fields with the rest of the fields not
being expressed as columns.
> Issues with this API mostly center around efficiency of how to deal with expressions
involving indexes (should operations be eager or lazy) and whether the use of multiplication
as retrieval is too controversial.  An alternative might be to add a query operation to the
API just for indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message