mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suneel Marthi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-1292) lucene2seq should validate the 'id' field
Date Sat, 16 Nov 2013 19:13:24 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suneel Marthi updated MAHOUT-1292:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> lucene2seq should validate the 'id' field
> -----------------------------------------
>
>                 Key: MAHOUT-1292
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1292
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.8
>            Reporter: Liz Merkhofer
>            Assignee: Suneel Marthi
>              Labels: cvb, lucene, solr
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1292.patch
>
>
> Lucene2seq creates only one sequencefile, rather than a file for each document in the
index.
> Running lucene2seq on my Solr (4.3) index produces a file with a header and, it seems,
the field I specified from the index, concatenated for all the documents. After running this
through seq2sparse and rowid (to prepare for cvb), the resulting matrix has only one row,
though it should create one row per document.
> This issue prevents, at least, data from a lucene index from being easily used as input
for cvb. Lucene.vector is also currently inadequate: the keys to its sequence files are LongWriteable,
and rowid will not convert only Text to IntWriteable, as is necessary for the keys in cvb.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message