lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Smith <ssm...@mainstreamdata.com>
Subject RE: Highlighting html pages
Date Tue, 06 Nov 2012 00:06:40 GMT
Since no one answered this, I decided I'd answer it myself (in case anyone else wanted the
answer).

First, there are two types of filters you can use in an Analyzer -- Character filters and
token filters.  Character filters get applied before tokenization and token filters get applied
after tokenization.  

So, my question was really nonsensical.  The HTMLStripCharFilter is a character filter and
therefore gets applied to the html data before it goes to the tokenizer.  You can then apply
any tokenizer you wish (including StandardTokenizer).

There is one caveat you might want to be aware of when using the HTMLStripCharFilter and then
highlighting search terms.  Assume you strip the html characters with the HTMLStripCharFilter
and then use the standard tokenizer.  Now you run it through the highlighter.  If there were
other html tags (besides whatever you are using for highlighting - <b> by default),
then you can have cases where your tags won't be properly nested. 

For example you could end up with:

	Now is <span class="underline">the <b>time</span></b> for all good
men to come... 

Note that the <b> isn't properly nested between the beginning and ending span.  For
straight html, I would assume the browser will work it out.  However, if you are using xml,
the document will become invalid.  The problem is that the html highlight code appears to
place the ending tag (the </b>) before the next word after the highlight term instead
of after the marked word ("time").  This means that if there are any html tags that the HTMLStripCharFilter
eliminated, the closing </b> will come after those characters instead of before.

Admittedly, you can make up cases where the highlighter will get it right, but it appears
to me that that only happens with phrases.  For single words (the more likely case), the closing
highlighting sequence (</b>) should be after the highlighted word.  Regardless, it's
impossible for the highlighter to get it right all the time and you may have to write code
that goes in and fixes stuff up if you're using xml or your really anal about tags being properly
nested.

Cheers

Scott

-----Original Message-----
From: Scott Smith [mailto:ssmith@mainstreamdata.com] 
Sent: Thursday, November 01, 2012 7:16 PM
To: Michael Sokolov; java-user@lucene.apache.org
Subject: RE: Highlighting html pages

I was trying to play with this.  Am I correct in assuming that this isn't going to work with
the StandardTokenizer (since it appears to strip angle brackets among other things)?  Does
HTMLStripCharFilter expect a WhiteSpaceTokenizer or a CharacterTokenizer or ??  

If I want to get rid of punctuation (commas, periods, semicolons, etc.) after the HTML stripping,
is there a filter?  Essentially, I want to get it back to what StandardTokenizer would give
me after I've stripped the HTML.

Suggestions?

Scott

-----Original Message-----
From: Michael Sokolov [mailto:sokolov@ifactory.com] 
Sent: Tuesday, October 23, 2012 9:04 PM
To: java-user@lucene.apache.org
Cc: Scott Smith
Subject: Re: Highlighting html pages

If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, and remembering
the word positions so that highlighting works properly.  Should do exactly what you want out
of the box...


On 10/23/2012 8:00 PM, Scott Smith wrote:
> I need to take an html page  that I retrieve from my lucene search and highlight all
of the terms that are part of the search.  I need to skip over any html tags since I don't
want any words in tags which happen to match the search to be highlighted.
>
> Note that I don't want sections of the document.  I need to highlight all terms in the
document (with a <span> or something similar) and get back the entire document (with
the new <span>s) so it can be displayed in its entirety with the search terms highlighted.
>
> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write a custom
tokenizer that skipped over the html tokens so that I didn't accidentally highlight them.
 I'm hoping that there is an easier way to do this now.
>
> Suggestions?
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message