lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov>
Subject RE: Profanity
Date Mon, 08 Jan 2018 22:11:44 GMT
Fun topic.   Same complicated issues as normal search:

Multilingual support?    Is "Merde" profanity too, or just in French.
Multi-word synonyms?       Does "God Damn" becomes "goddamn", or do you treat "Damn" and "God
damn" the same because you drop "God"
                                     "Merde Alors" is same as "Merde" or again multi-word
synonyms

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Monday, January 8, 2018 4:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Profanity

Yes, an UpdateRequestProcessor is the API to implement for these sorts of requirements. In
the URP you have access to a SolrDocument object that carries the input data. You can inspect
the fields, and add, remove or modify fields if you want, or discard the input altogether.

So, check your text input field for 'profanity' and set another boolean field if it matches
or doesn't. If you are using a list of words - or an SVM or another machine learning algorithm
- to detect provanity is up to you.

Cheers,
Markus
 
-----Original message-----
> From:Sadiki Latty <slatty@uottawa.ca>
> Sent: Monday 8th January 2018 22:12
> To: solr-user@lucene.apache.org
> Subject: Profanity
> 
> Hey
> 
> I would like to find a solution to flag (at index-time) profanity. Optimally, it would
be good if it function similar to stopwords in the sense that I can have a predefined list
that is read and if token is on the list that document is 'flagged' in a different field.
Does anyone know of solution (outside of configuring my own). If none exists and I end up
configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new
to Solr, but from what I've read, that seems to be the best place to look.
> 
> 
> Thanks,
> 
> Sid
> 
Mime
View raw message