Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Date: Sun, 18 May 2014 21:53:04 +0000 (UTC)
From: "Richard Scharrer (JIRA)" <jira@apache.org>
To: dev@mahout.apache.org
Message-ID: <JIRA.12712765.1399442123303.804.1400449984264@arcas>
In-Reply-To: <JIRA.12712765.1399442123303@arcas>
References: <JIRA.12712765.1399442123303@arcas>
Subject: [jira] [Commented] (MAHOUT-1549) Extracting tfidf-vectors by key
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAHOUT-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001224#comment-14001224 ] 

Richard Scharrer commented on MAHOUT-1549:
------------------------------------------

Yes! https://github.com/kevinweil/elephant-bird/issues/389 has the solution.

> Extracting tfidf-vectors by key
> -------------------------------
>
>                 Key: MAHOUT-1549
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1549
>             Project: Mahout
>          Issue Type: Question
>          Components: Classification
>    Affects Versions: 0.7, 0.8, 0.9
>            Reporter: Richard Scharrer
>              Labels: documentation, features, newbie
>             Fix For: 0.7, 0.8, 0.9
>
>
> Hi,
> I have about 200000 tfidf-vectors and I need to extract 500 of them of which I have the keys. Is there some kind of magical option which allows me something like taking the output of mahout seqdumper and transform it back into a sequencefile that I can use for trainnb /testnb? The sequencefiles of tfidf use the Text class for the keys and the VectorWritable class for the values. I tried 
> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
> with different settings but the output always gives me the Text class for both, key and value which can't be used in trainnb and testnb.
> I posted this question on:
> http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat
> I ask this question in here because I've seen similar questions on stackoverflow that where asked last year and still didn't get an answer
> I really need this information so in case you know anything please tell me.
> Regards,
> Richard


--
This message was sent by Atlassian JIRA
(v6.2#6252)