mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Email and Collab. Filtering
Date Wed, 31 Aug 2011 15:21:32 GMT

On Aug 22, 2011, at 12:14 PM, Sean Owen wrote:

> Here are two ideas:
> 
> Recommend threads to users.
> Users are people, items are threads. This might suggest discussions
> you should be a party to, or may be of interest since it concerns
> people you often share a thread with. I think it has slightly more
> potential to be useful, but, probably a non-starter in practice as
> it's not generally true that you'er welcome to see a thread you
> weren't copied on.

This is the one I am doing.  But it brings up an interesting question in how best to convert
the input to ids.

To do this, I need to convert the strings (message id, from) to ids.  Thus, I more or less
modeled the code after what DictionaryVectorizer does.  Creating the dictionaries is pretty
straightforward and we likely now have an opportunity to make a general purpose tool that
does it in an M/R way.

Digging in a bit more, I am now working on doing the actual matrix creation.  In my case,
I have two dictionaries:  message ids and from emails.  In DictionaryVectorizer (used to take
text to sparse vectors, which is comparable to what I need to do), it creates the matrix by
running:

for each dictionary chunk
	for each piece of text  //i.e. the input sequence file, handled by Hadoop
		create the  (partial) vector

My initial thoughts for my case are to do:

for each from id dictionary chunk
	for each message id dictionary chunk
		for each piece of text //i.e. the input seq. file, handled by Hadoop
			create the vector

The output would be, for each "from" a list of message ids that the person interacted with
(initiated or replied). It's likely that some of this moot b/c there will only ever be 1 or
two chunks, esp. for the "froms".

As you can no doubt see, that's a lot of loops and add on top of it you figure the hit ratio
is pretty sparse.   I believe the reason we do this in DictionaryVectorizer is so that we
can use a predictable amount of memory in dealing with the dictionaries.

Is there a better way of doing this?  

-Grant
Mime
View raw message