lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Word Locations & Search Components
Date Tue, 17 Feb 2009 04:35:48 GMT

Wouldn't this be as easy as:
- split email into "paragraphs"
- for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799)
- for each signature look for other emails with this signature
- when you find an email with an identical signature, you know you've found the "banner"

I'd do this in a pre-processing phase.  You may have to add special logic for ">" and
other email-quoting characters.  Perhaps you can make use of assumption that banners always
come at the end of emails.  Perhaps you can make use of situations where the banner appears multiple
times in a single email (the one with lots of back-and-forth replies, for example).

This is similar to MoreLikeThis on paragraph level.

Sematext -- -- Lucene - Solr - Nutch

From: Johnny X <>
Sent: Monday, February 16, 2009 11:05:40 PM
Subject: Re: Word Locations & Search Components

Basically I'm working on the Enron dataset, and I've already de-duplicated
the collection and applied a spam filter. All the e-mails after this have
been parsed to XML and each field (so To, From, Date etc) has been
separated, along with one large field for the remaining e-mail content
(called Content). 

So yes, to answer your question. Bearing in mind though this still
represents around 240, 000ish files to compute.

I have no idea about Solr analyzers/search components, but my theory was
that I'd need an analyzer to remove 'banner-like' content from being indexed
and a search component to identify 'corporate-like' information in the
content of the e-mails.

What is a business logical solution and how will that work?


zayhen wrote:
> I would go for a business logic solution and not a Solr customization in
> this case, as you need to filter information that you actually would like
> to
> see in diferent fields on your index.
> Did you already tried to split the email in several fields like subject,
> from, to, content, signature, etc etc etc ?
> 2009/2/16 Johnny X <>
>> Hi there,
>> I was told before that I'd need to create a custom search component to do
>> what I want to do, but I'm thinking it might actually be a custom
>> analyzer.
>> Basically, I'm indexing e-mail in XML in Solr and searching the 'content'
>> field which is parsed as 'text'.
>> I want to ignore certain elements of the e-mail (i.e. corporate banners),
>> but also identify the actual content of those e-mails including corporate
>> information.
>> To identify the banners I need something a little more developed than a
>> stop
>> word list. I need to evaluate the frequency of certain words around words
>> like 'privileged' and 'corporate' within a word window of about 100ish
>> words
>> to determine whether they're banners and then remove them from being
>> indexed.
>> I need to do the opposite during the same time to identify, in a similar
>> manner, which e-mails include corporate information in their actual
>> content.
>> I suppose if I'm doing this I don't want what's processed to be indexed
>> as
>> what's returned in a search, because then presumably it won't be the full
>> e-mail, so do I need to store some kind of copy field that keeps the full
>> e-mail and is fully indexed to be returned instead?
>> Can what I'm suggesting be done and can anyone direct me to a guide?
>> On another note, is there an easy way to destroy an index...any custom
>> code?
>> Thanks for any help!
>> --
>> View this message in context:
>> Sent from the Solr - User mailing list archive at
> -- 
> Alexander Ramos Jardim
> -----
> RPG da Ilha 

View this message in context:
Sent from the Solr - User mailing list archive at
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message