lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Improving Readability of Hit Highlighting
Date Mon, 12 Jan 2009 20:38:05 GMT
I'm not sure if I have a good suggestion, but I have a question. :)  What is considered "junk"?
 Would it be possible to eliminate the junk before it even goes into the index in order to
avoid GIGO (Garbage In Garbage Out)?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Terence Gannon <butzi0112@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, January 12, 2009 11:00:31 AM
> Subject: Improving Readability of Hit Highlighting
> 
> I'm indexing text from an OCR of an old document.  Many words get read
> perfectly, but they're typically embedded in a lot of junk.  I would
> like the hit highlighting to show only the 'good' words, in the order
> in which they appeared in the original document.  Is it possible to
> use output of the filter classes as the text used in hit highlighting?
> Or do you have to all the text cleanup outside of Solr and present it
> with two fields to index, one with the original text, and one with the
> cleaned up text.  The objective of the hit highlighting is to give the
> user a *sense* of the original context, even if it's not provided
> verbatim from the original document.  Thanks in advance.
> 
> TerryG


Mime
View raw message