mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Mail thread detection [was Email and Collab. Filtering]
Date Thu, 25 Aug 2011 12:40:33 GMT
I think on Lucid's mail site, we use a combination of message id, subject and a few other heuristics.
 The whole problem gets even more fun when you think about the fact that people can essentially
reopen a thread at any point in the future (even years later).

Ironically, this very thread, will likely cause problems since it has the same message id,
even though the subject line was partially changed.

On Aug 24, 2011, at 3:15 PM, Lukáš Vlček wrote:

> It is, but since obviously other developers has already been dealing with
> this mess (especially thread identification in mail lists) I was hoping that
> there would be some knowledge gathered ... may be it would be worth the
> effort to put something together because this is important piece of
> knowledge that can influence search results but people (users of search
> interfaces) do not usually think about it in detail.
> On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <> wrote:
>> The short conclusion is "people and language are involved, therefore it is
>> a
>> bit of a mess".
>> On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <>
>> wrote:
>>> Yes, it is not always reliable (especially if ppl reply to the email from
>>> desktop email clients and not from the web forum page). But there are
>> more
>>> complex problems than this. The two most common problems are also thread
>>> hijacking and something what I call non-linear mail thread, that is a
>> case
>>> when the email is resent also to a different mail list. For example the
>>> thread starts in Lucene but at some point in time someone adds Solr mail
>>> list to the To or Cc as well. From this point the thread has two parallel
>>> branches (and still this is the simple case).
>>> Experimenting with mail Subject text is another option but again one
>> would
>>> not believe what kind of cases/or exceptions can be found until he tries
>>> it.
>>> I have seen mails with the same subject, in the same mail list, in about
>>> the
>>> same time window, involving the same author and the same reply-from
>> person
>>> and they were not in the same thread.
>>> IMHO I do not think there is any perfect solution to this problem. Doing
>> a
>>> lot of experiments is probably a good way how to catch the most common
>>> exceptions but in general it is very hard to avoid these problems. And
>> once
>>> you (as a user of a search interface) experience these issues it can be
>>> quite challenging to build a trust that things like thread grouping or
>>> recommendation works well enough.
>>> On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <>
>>> wrote:
>>>> In the olden days, it was possible to thread together message id's in
>>> email
>>>> threads.
>>>> In the modern world of many mailing list portals that don't really do
>>> email
>>>> in the official ways, this is more difficult than it should be.
>>>> Have you tried and failed with message id's?
>>>> On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <>
>>>> wrote:
>>>>> Hi,
>>>>> I would love to hear more about how exactly you detect (or define)
>>>> threads
>>>>> for emails (for example for Lucene or Solr public mail lists).
>>>>> As far as I can tell this is quite complex problem and based on my
>>>>> experience with many search web tools for mail lists this is still
>> not
>>>>> solved. Speaking about thread based recommendations there can be
>> missed
>>>>> important information if the thread is not detected correctly.
>>>>> If this has been already solved then please do not hesitate to point
>> me
>>>> to
>>>>> any references.
>>>>> Reagards,
>>>>> Lukas
>>>>> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <
>>>>>> wrote:
>>>>>> I'm working on an example (well, examples) of using Mahout with the
>>> ASF
>>>>>> Public Data Set up on Amazon (
>>>>>> and I wanted to
>>> show
>>>>> how
>>>>>> to use the 3 "C's" (collab filtering, clustering, classification)
>>> with
>>>>> the
>>>>>> data set.  Clustering and classification are pretty straight
>> forward,
>>>> but
>>>>>> I'm wondering about the setup around collaborative filtering.
>>>>>> The motivation for recommendations is pretty straightforward:
>>> provide
>>>>>> people recs on emails that they might find useful based on what
>> other
>>>>> people
>>>>>> have interacted with.  The tricky part is I am not totally sure on
>> a
>>>>> valid
>>>>>> setup of the problem.  My current thinking is that I build up the
>>> rec.
>>>>>> matrix based on whether someone has interacted with
>>> (initiated/replied)
>>>> a
>>>>>> thread or not.  Thus, the columns are the thread ids and the rows
>> are
>>>> the
>>>>>> users.  Each cell contains the count of the number of times user
>>> has
>>>>>> interacted with thread Y.  This feels to me like it is a stand in
>> for
>>>>> that
>>>>>> user's preference in that if they are replying multiple times, they
>>>> have
>>>>> an
>>>>>> interest in that topic.  I have no idea if this will be effective
>> or
>>>> not,
>>>>>> but it seems like it could be interesting.  Does it sound
>> reasonable?
>>>> I
>>>>>> worry that even in a really large data set as above it will simply
>> be
>>>> too
>>>>>> sparse.
>>>>>> Is there a better way to think about this from a strict
>> collaborative
>>>>>> filtering context?  In other words, I know I could do content-based
>>>>>> recommendations but that is not what I am after here.
>>>>>> -Grant
>>>>>> --------------------------------------------
>>>>>> Grant Ingersoll

Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message