Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
From: Grant Ingersoll <gsingers@apache.org>
Mime-Version: 1.0 (Apple Message framework v1244.3)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC"
Subject: Re: vectors from pre-tokenized terms
Date: Wed, 14 Sep 2011 11:18:19 -0400
In-Reply-To: <BLU0-SMTP143F457D33CA67B6D894B4BCA050@phx.gbl>
To: user@mahout.apache.org
References: <BLU0-SMTP264A0137C86DF1AB5C40F1CA010@phx.gbl>
 <BLU0-SMTP143F457D33CA67B6D894B4BCA050@phx.gbl>
Message-Id: <D6092748-E07E-459B-ABCC-570F0F41139A@apache.org>

--Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

I think createDictionaryChunks is the first thing that runs inside of =
createTermFrequencyVectors.  It takes the input from =
DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple. =20=

So, I would suspect you would need Text, StringTuple as inputs.    See =
SequenceFileTokenizerMapper.java.


On Sep 13, 2011, at 10:52 AM, Jack Tanner wrote:

> Ping? Please help if you can. Maybe I was unclear the first time; let =
me try again.
>=20
> I have input like this:
>=20
> term_id,doc_id
> 55,1
> 61,1
> 29,2
> 98,3
>=20
> I want to do clustering, so (I think) I need to transform that into a =
bunch of SequenceFile objects.
>=20
> key:1,value:<55,61>
> key:2,value<29>
> key:3,value<98>
>=20
> What's the format of the SequenceFile value? IntTuple? IntegerTuple? =
something else?
>=20
> The next step would be to use =
DictionaryVectorizer.createTermFrequencyVectors and =
TFIDFConverter.processTfIdf, just like in =
SparseVectorsFromSequenceFiles.
>=20
> On 9/9/2011 12:17 PM, Jack Tanner wrote:
>> Hi all. I've got some documents described by binary features with
>> integer ids, and i want to read them into sparse mahout vectors to do
>> tfidf weighting and clustering. I do not want to paste them back
>> together and run a Lucene tokenizer. What's the clean way to do this?
>>=20
>> I'm thinking that I need to write out SequenceFile objects, with a
>> document id key and a value that's either an IntTuple. Is that right?
>> Should I use an IntegerTuple instead? It feels wrong to use either,
>> actually, because these tuples claim to be ordered, but my features =
are
>> not ordered.
>>=20
>> I would then use DictionaryVectorizer.createTermFrequencyVectors and
>> TFIDFConverter.processTfIdf, just like in =
SparseVectorsFromSequenceFiles.
>>=20
>> Am I on the right track?
>>=20
>>=20
>=20

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com


--Apple-Mail=_FF072E7C-8726-4D74-A15A-D76000581FAC--