mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukáš Vlček <lukas.vl...@gmail.com>
Subject Re: Mail thread detection [was Email and Collab. Filtering]
Date Thu, 25 Aug 2011 12:47:20 GMT
On Thu, Aug 25, 2011 at 2:40 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> I think on Lucid's mail site, we use a combination of message id, subject
> and a few other heuristics.  The whole problem gets even more fun when you
> think about the fact that people can essentially reopen a thread at any
> point in the future (even years later).
>
> Ironically, this very thread, will likely cause problems since it has the
> same message id, even though the subject line was partially changed.
>

Exactly, done on purpose :-)


>
> On Aug 24, 2011, at 3:15 PM, Lukáš Vlček wrote:
>
> > It is, but since obviously other developers has already been dealing with
> > this mess (especially thread identification in mail lists) I was hoping
> that
> > there would be some knowledge gathered ... may be it would be worth the
> > effort to put something together because this is important piece of
> > knowledge that can influence search results but people (users of search
> > interfaces) do not usually think about it in detail.
> >
> > On Wed, Aug 24, 2011 at 11:57 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> >> The short conclusion is "people and language are involved, therefore it
> is
> >> a
> >> bit of a mess".
> >>
> >>
> >>
> >> On Wed, Aug 24, 2011 at 2:49 PM, Lukáš Vlček <lukas.vlcek@gmail.com>
> >> wrote:
> >>
> >>> Yes, it is not always reliable (especially if ppl reply to the email
> from
> >>> desktop email clients and not from the web forum page). But there are
> >> more
> >>> complex problems than this. The two most common problems are also
> thread
> >>> hijacking and something what I call non-linear mail thread, that is a
> >> case
> >>> when the email is resent also to a different mail list. For example the
> >>> thread starts in Lucene but at some point in time someone adds Solr
> mail
> >>> list to the To or Cc as well. From this point the thread has two
> parallel
> >>> branches (and still this is the simple case).
> >>>
> >>> Experimenting with mail Subject text is another option but again one
> >> would
> >>> not believe what kind of cases/or exceptions can be found until he
> tries
> >>> it.
> >>> I have seen mails with the same subject, in the same mail list, in
> about
> >>> the
> >>> same time window, involving the same author and the same reply-from
> >> person
> >>> and they were not in the same thread.
> >>>
> >>> IMHO I do not think there is any perfect solution to this problem.
> Doing
> >> a
> >>> lot of experiments is probably a good way how to catch the most common
> >>> exceptions but in general it is very hard to avoid these problems. And
> >> once
> >>> you (as a user of a search interface) experience these issues it can be
> >>> quite challenging to build a trust that things like thread grouping or
> >>> recommendation works well enough.
> >>>
> >>> On Wed, Aug 24, 2011 at 11:15 PM, Ted Dunning <ted.dunning@gmail.com>
> >>> wrote:
> >>>
> >>>> In the olden days, it was possible to thread together message id's in
> >>> email
> >>>> threads.
> >>>>
> >>>> In the modern world of many mailing list portals that don't really do
> >>> email
> >>>> in the official ways, this is more difficult than it should be.
> >>>>
> >>>> Have you tried and failed with message id's?
> >>>>
> >>>> On Wed, Aug 24, 2011 at 1:06 PM, Lukáš Vlček <lukas.vlcek@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I would love to hear more about how exactly you detect (or define)
> >>>> threads
> >>>>> for emails (for example for Lucene or Solr public mail lists).
> >>>>>
> >>>>> As far as I can tell this is quite complex problem and based on
my
> >>>>> experience with many search web tools for mail lists this is still
> >> not
> >>>>> solved. Speaking about thread based recommendations there can be
> >> missed
> >>>>> important information if the thread is not detected correctly.
> >>>>> If this has been already solved then please do not hesitate to point
> >> me
> >>>> to
> >>>>> any references.
> >>>>>
> >>>>> Reagards,
> >>>>> Lukas
> >>>>>
> >>>>> On Mon, Aug 22, 2011 at 4:48 PM, Grant Ingersoll <
> >> gsingers@apache.org
> >>>>>> wrote:
> >>>>>
> >>>>>> I'm working on an example (well, examples) of using Mahout with
the
> >>> ASF
> >>>>>> Public Data Set up on Amazon (
> >>>>>> http://aws.amazon.com/datasets/7791434387204566) and I wanted
to
> >>> show
> >>>>> how
> >>>>>> to use the 3 "C's" (collab filtering, clustering, classification)
> >>> with
> >>>>> the
> >>>>>> data set.  Clustering and classification are pretty straight
> >> forward,
> >>>> but
> >>>>>> I'm wondering about the setup around collaborative filtering.
> >>>>>>
> >>>>>> The motivation for recommendations is pretty straightforward:
> >>> provide
> >>>>>> people recs on emails that they might find useful based on what
> >> other
> >>>>> people
> >>>>>> have interacted with.  The tricky part is I am not totally sure
on
> >> a
> >>>>> valid
> >>>>>> setup of the problem.  My current thinking is that I build up
the
> >>> rec.
> >>>>>> matrix based on whether someone has interacted with
> >>> (initiated/replied)
> >>>> a
> >>>>>> thread or not.  Thus, the columns are the thread ids and the
rows
> >> are
> >>>> the
> >>>>>> users.  Each cell contains the count of the number of times
user X
> >>> has
> >>>>>> interacted with thread Y.  This feels to me like it is a stand
in
> >> for
> >>>>> that
> >>>>>> user's preference in that if they are replying multiple times,
they
> >>>> have
> >>>>> an
> >>>>>> interest in that topic.  I have no idea if this will be effective
> >> or
> >>>> not,
> >>>>>> but it seems like it could be interesting.  Does it sound
> >> reasonable?
> >>>> I
> >>>>>> worry that even in a really large data set as above it will
simply
> >> be
> >>>> too
> >>>>>> sparse.
> >>>>>>
> >>>>>> Is there a better way to think about this from a strict
> >> collaborative
> >>>>>> filtering context?  In other words, I know I could do content-based
> >>>>>> recommendations but that is not what I am after here.
> >>>>>>
> >>>>>> -Grant
> >>>>>>
> >>>>>> --------------------------------------------
> >>>>>> Grant Ingersoll
> >>>>>> http://www.lucidimagination.com
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message