lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Word Locations & Search Components
Date Mon, 16 Feb 2009 13:20:30 GMT

On Feb 15, 2009, at 10:33 PM, Johnny X wrote:

>
> Hi there,
>
>
> I was told before that I'd need to create a custom search component  
> to do
> what I want to do, but I'm thinking it might actually be a custom  
> analyzer.
>
> Basically, I'm indexing e-mail in XML in Solr and searching the  
> 'content'
> field which is parsed as 'text'.
>
> I want to ignore certain elements of the e-mail (i.e. corporate  
> banners),
> but also identify the actual content of those e-mails including  
> corporate
> information.
>
> To identify the banners I need something a little more developed  
> than a stop
> word list. I need to evaluate the frequency of certain words around  
> words
> like 'privileged' and 'corporate' within a word window of about  
> 100ish words
> to determine whether they're banners and then remove them from being
> indexed.
>
> I need to do the opposite during the same time to identify, in a  
> similar
> manner, which e-mails include corporate information in their actual  
> content.
>
> I suppose if I'm doing this I don't want what's processed to be  
> indexed as
> what's returned in a search, because then presumably it won't be the  
> full
> e-mail, so do I need to store some kind of copy field that keeps the  
> full
> e-mail and is fully indexed to be returned instead?

Storage and indexing are separate things in Lucene/Solr, so setting  
the Field as stored will keep the original, so no need for a copy  
field for this particular issue.

>
>
> Can what I'm suggesting be done and can anyone direct me to a guide?

Hmm, this kind of stuff may be better off as part of preprocessing,  
but it could be done as an analyzer, I suppose. How are you  
determining the words to evaluate?  Is it based on collection  
statistics or just within a document?  Or do you just have a list of  
"marker" words that indicate the areas of interest?  Do you need to  
keep track of anything beyond the life of one document being analyzed?

If you were doing this as an analyzer, you would need to buffer the  
tokens internally so that you could examine them in a window, and then  
make a decision as to what tokens to output.  I believe the  
RemoveDuplicatesTokenFilter demonstrates how to do this.  Basically,  
you just need a List to store the tokens in if you see certain  
conditions met.



>
>
>
> On another note, is there an easy way to destroy an index...any  
> custom code?

Send in a delete by query command with the *:* query.

>
>
>
> Thanks for any help!
>
>
>
> -- 
> View this message in context: http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message