Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EB80572DC for ; Wed, 14 Sep 2011 15:18:07 +0000 (UTC) Received: (qmail 3787 invoked by uid 500); 14 Sep 2011 15:18:06 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 3757 invoked by uid 500); 14 Sep 2011 15:18:06 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 3746 invoked by uid 99); 14 Sep 2011 15:18:06 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Sep 2011 15:18:06 +0000 Received: from localhost (HELO [10.0.0.12]) (127.0.0.1) (smtp-auth username gsingers, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Sep 2011 15:18:06 +0000 From: Grant Ingersoll Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: multipart/alternative; boundary="Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC" Subject: Re: vectors from pre-tokenized terms Date: Wed, 14 Sep 2011 11:18:19 -0400 In-Reply-To: To: user@mahout.apache.org References: Message-Id: X-Mailer: Apple Mail (2.1244.3) --Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 I think createDictionaryChunks is the first thing that runs inside of = createTermFrequencyVectors. It takes the input from = DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple. =20= So, I would suspect you would need Text, StringTuple as inputs. See = SequenceFileTokenizerMapper.java. On Sep 13, 2011, at 10:52 AM, Jack Tanner wrote: > Ping? Please help if you can. Maybe I was unclear the first time; let = me try again. >=20 > I have input like this: >=20 > term_id,doc_id > 55,1 > 61,1 > 29,2 > 98,3 >=20 > I want to do clustering, so (I think) I need to transform that into a = bunch of SequenceFile objects. >=20 > key:1,value:<55,61> > key:2,value<29> > key:3,value<98> >=20 > What's the format of the SequenceFile value? IntTuple? IntegerTuple? = something else? >=20 > The next step would be to use = DictionaryVectorizer.createTermFrequencyVectors and = TFIDFConverter.processTfIdf, just like in = SparseVectorsFromSequenceFiles. >=20 > On 9/9/2011 12:17 PM, Jack Tanner wrote: >> Hi all. I've got some documents described by binary features with >> integer ids, and i want to read them into sparse mahout vectors to do >> tfidf weighting and clustering. I do not want to paste them back >> together and run a Lucene tokenizer. What's the clean way to do this? >>=20 >> I'm thinking that I need to write out SequenceFile objects, with a >> document id key and a value that's either an IntTuple. Is that right? >> Should I use an IntegerTuple instead? It feels wrong to use either, >> actually, because these tuples claim to be ordered, but my features = are >> not ordered. >>=20 >> I would then use DictionaryVectorizer.createTermFrequencyVectors and >> TFIDFConverter.processTfIdf, just like in = SparseVectorsFromSequenceFiles. >>=20 >> Am I on the right track? >>=20 >>=20 >=20 -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com --Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC--