Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 632C711149 for ; Sun, 18 May 2014 22:19:49 +0000 (UTC) Received: (qmail 21488 invoked by uid 500); 18 May 2014 21:56:25 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 21290 invoked by uid 500); 18 May 2014 21:56:25 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 21062 invoked by uid 99); 18 May 2014 21:53:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 May 2014 21:53:04 +0000 Date: Sun, 18 May 2014 21:53:04 +0000 (UTC) From: "Richard Scharrer (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAHOUT-1549) Extracting tfidf-vectors by key MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001224#comment-14001224 ] Richard Scharrer commented on MAHOUT-1549: ------------------------------------------ Yes! https://github.com/kevinweil/elephant-bird/issues/389 has the solution. > Extracting tfidf-vectors by key > ------------------------------- > > Key: MAHOUT-1549 > URL: https://issues.apache.org/jira/browse/MAHOUT-1549 > Project: Mahout > Issue Type: Question > Components: Classification > Affects Versions: 0.7, 0.8, 0.9 > Reporter: Richard Scharrer > Labels: documentation, features, newbie > Fix For: 0.7, 0.8, 0.9 > > > Hi, > I have about 200000 tfidf-vectors and I need to extract 500 of them of which I have the keys. Is there some kind of magical option which allows me something like taking the output of mahout seqdumper and transform it back into a sequencefile that I can use for trainnb /testnb? The sequencefiles of tfidf use the Text class for the keys and the VectorWritable class for the values. I tried > https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java > with different settings but the output always gives me the Text class for both, key and value which can't be used in trainnb and testnb. > I posted this question on: > http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat > I ask this question in here because I've seen similar questions on stackoverflow that where asked last year and still didn't get an answer > I really need this information so in case you know anything please tell me. > Regards, > Richard -- This message was sent by Atlassian JIRA (v6.2#6252)