mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility
Date Mon, 13 Feb 2012 04:30:59 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206670#comment-13206670
] 

Lance Norskog commented on MAHOUT-944:
--------------------------------------

This is a Lucene query. It's already sorted! So, the sequential algorithm should already do
this. It would be helpful if the sequential version could split the output across multiple
files. This allows the subsequent m/r jobs to run more efficiently.

Text search applications (Solr, Elasticsearch, Indextank, Katta) support splitting large indexes
into "shards" across multiple computers. If this is a map/reduce job, it can handle index
shards from multiple computers, and set target disk file sizes. 




                
> LuceneIndexToSequenceFiles (lucene2seq) utility
> -----------------------------------------------
>
>                 Key: MAHOUT-944
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-944
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Integration
>    Affects Versions: 0.5
>            Reporter: Frank Scholten
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files based on the
stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you can do text
clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this for a small-scale
project +- 100.000 docs. Is a MR version useful or is that overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments
from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message