lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Highlighting html pages
Date Tue, 06 Nov 2012 08:29:53 GMT
Hi Scott,

HTMLStripCharFilter doesn't require that its input be valid HTML - there is no assumption
of balanced tags.  

Also, highlighted sections could span tags, e.g. if you highlight "this phrase", and the original
HTML looks like:

	… this<span>phrase</span> …

the highlighting code would have to know to put multiple tags to avoid non-wellformedness,
maybe something like:

	… <b>this</b><span><b>phrase</b></span> … 

If you do develop a solution here, it would be great if you could share it with the community.

Also, I think it would be useful to have an XML-specific stripping char filter - it's on my
long term to-do list :).

Steve

On Nov 6, 2012, at 1:06 AM, Scott Smith <ssmith@mainstreamdata.com> wrote:

> Since no one answered this, I decided I'd answer it myself (in case anyone else wanted
the answer).
> 
> First, there are two types of filters you can use in an Analyzer -- Character filters
and token filters.  Character filters get applied before tokenization and token filters get
applied after tokenization.  
> 
> So, my question was really nonsensical.  The HTMLStripCharFilter is a character filter
and therefore gets applied to the html data before it goes to the tokenizer.  You can then
apply any tokenizer you wish (including StandardTokenizer).
> 
> There is one caveat you might want to be aware of when using the HTMLStripCharFilter
and then highlighting search terms.  Assume you strip the html characters with the HTMLStripCharFilter
and then use the standard tokenizer.  Now you run it through the highlighter.  If there were
other html tags (besides whatever you are using for highlighting - <b> by default),
then you can have cases where your tags won't be properly nested. 
> 
> For example you could end up with:
> 
> 	Now is <span class="underline">the <b>time</span></b> for all
good men to come... 
> 
> Note that the <b> isn't properly nested between the beginning and ending span.
 For straight html, I would assume the browser will work it out.  However, if you are using
xml, the document will become invalid.  The problem is that the html highlight code appears
to place the ending tag (the </b>) before the next word after the highlight term instead
of after the marked word ("time").  This means that if there are any html tags that the HTMLStripCharFilter
eliminated, the closing </b> will come after those characters instead of before.
> 
> Admittedly, you can make up cases where the highlighter will get it right, but it appears
to me that that only happens with phrases.  For single words (the more likely case), the closing
highlighting sequence (</b>) should be after the highlighted word.  Regardless, it's
impossible for the highlighter to get it right all the time and you may have to write code
that goes in and fixes stuff up if you're using xml or your really anal about tags being properly
nested.
> 
> Cheers
> 
> Scott
> 
> -----Original Message-----
> From: Scott Smith [mailto:ssmith@mainstreamdata.com] 
> Sent: Thursday, November 01, 2012 7:16 PM
> To: Michael Sokolov; java-user@lucene.apache.org
> Subject: RE: Highlighting html pages
> 
> I was trying to play with this.  Am I correct in assuming that this isn't going to work
with the StandardTokenizer (since it appears to strip angle brackets among other things)?
 Does HTMLStripCharFilter expect a WhiteSpaceTokenizer or a CharacterTokenizer or ??  
> 
> If I want to get rid of punctuation (commas, periods, semicolons, etc.) after the HTML
stripping, is there a filter?  Essentially, I want to get it back to what StandardTokenizer
would give me after I've stripped the HTML.
> 
> Suggestions?
> 
> Scott
> 
> -----Original Message-----
> From: Michael Sokolov [mailto:sokolov@ifactory.com] 
> Sent: Tuesday, October 23, 2012 9:04 PM
> To: java-user@lucene.apache.org
> Cc: Scott Smith
> Subject: Re: Highlighting html pages
> 
> If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, and remembering
the word positions so that highlighting works properly.  Should do exactly what you want out
of the box...
> 
> 
> On 10/23/2012 8:00 PM, Scott Smith wrote:
>> I need to take an html page  that I retrieve from my lucene search and highlight
all of the terms that are part of the search.  I need to skip over any html tags since I don't
want any words in tags which happen to match the search to be highlighted.
>> 
>> Note that I don't want sections of the document.  I need to highlight all terms in
the document (with a <span> or something similar) and get back the entire document (with
the new <span>s) so it can be displayed in its entirety with the search terms highlighted.
>> 
>> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write a custom
tokenizer that skipped over the html tokens so that I didn't accidentally highlight them.
 I'm hoping that there is an easier way to do this now.
>> 
>> Suggestions?
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message