lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Negative Filtering (such as for profanity)
Date Thu, 08 Mar 2007 19:56:54 GMT
I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability  
to delete all documents containing a term.  So, every time you update  
your profanity list, you could iterate over it and remove all  
documents that contain the terms.

If a user can never get these documents via a query, then I don't see  
any reason to allow them in the index to begin with.

Also, I don't use QueryFilters much, but  I'm curious as to how they  
perform on that many docs.

On Mar 7, 2007, at 5:38 PM, Greg Gershman wrote:

> I thought about this, as I think overall the resources required  
> would be less than creating a filter.  Ultimately I decided against  
> it for a few reasons:
> 1) I'm working with an existing index of ~50 million documents, I  
> don't want to reindex the whole thing, or even just the documents  
> that contain profanity, if I can avoid it.
> 2) Filtering at indexing time means I can't effectively add new  
> words to the profanity list without reindexing.
> Good suggestion, though, I appreciate it.
> Greg
> ----- Original Message ----
> From: Grant Ingersoll <>
> To:
> Sent: Wednesday, March 7, 2007 2:07:38 PM
> Subject: Re: Negative Filtering (such as for profanity)
> Not sure if this helpful given your proposed solution, but could you
> do something on the indexing side, such as:
> 1.  Remove the profanity from the token stream, much like a
> stopword.  This would also mean stripping it from the display text
> 2. If your TokenFilter comes across a profanity, somehow mark the
> document as containing a profanity via a "profanity" Field (not sure
> if there is a way, in Lucene, to add another Field while you are in
> the analysis phase, but you could also have it update a table in a db
> or something.)  Then on search, you could just say (regular query)
> +profanity:false
> HTH,
> Grant
> On Mar 7, 2007, at 10:07 AM, Greg Gershman wrote:
>> I'm attempting to create a profanity filter.  I thought to use a
>> QueryFilter created with a Query of (-$#!+ AND -@#$% AND etc).  The
>> problem I have run into is that, as a pure negative query is not
>> supported (a query for (-term) DOES NOT return the inverse of a
>> query for (term)), I believe the bit set returned by a purely
>> negative QueryFilter is empty, so no matter how many results
>> returned by the initial query, the result after filtering is always
>> zero documents.
>> I was wondering if anyone had suggestions as to how else to do
>> this.  I've considered simply amending the query string submitted
>> by the user to include a pre-generated String that would exclude
>> the query terms, but I consider this a non-elegant solution.  I had
>> also thought about creating a new sub-class of QueryFilter,
>> NegativeQueryFilter.  Basically, it would works just like a
>> QueryFilter, taking a positive query (so, I would pass it an OR'ed
>> list of profane words), then the resulting bits are simply
>> flipped.  I think this would work, unless I'm missing something.
>> I'm going to experiment with it, I'd appreciate anyone's thoughts
>> on this.
>> Thanks,
>> Greg
>> _____________________________________________________________________ 
>> _
>> ______________
>> It's here! Your new message!
>> Get new email alerts with the free Yahoo! Toolbar.
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> Read the Lucene Java FAQ at
> LuceneFAQ
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ______________________________________________________________________ 
> ______________
> Need Mail bonding?
> Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.

Grant Ingersoll

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message