lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Toth <ft...@synernet.com>
Subject Preserving original HTML file offsets for highlighting, need HTMLTokenizer?
Date Mon, 30 May 2005 18:03:11 GMT
Hi all,

Those of you who have read and responded to my recent posts know
that we are working on highlighting the entire document after a search.
(Not fragments in a results list.)

It appears that one of the key tools to assist with this is the ability of
Lucene to store file offsets of terms as part of the TermVector 
(TermVectorOffsetInfo).
However, this is not simple to do when indexing HTML files.Here's why:

If you pass HTML to the StandardAnalyzer, you will get file offsets that are
exactly correct. However, the StandardAnalyzer doesn't know anything
about HTML, so you get tokens like "html", "head", "title", "body", etc.
This may or may not be what you want. In our case, it's not what we want.
If we search for "body", we don't want to get a hit on every single HTML file
in the index because we've indexed the <body> tag.

The traditional solution to this is to employ an HTML parser in the indexing
phase. Depending on the parser used (such as the demo parser supplied
with Lucene or NekoHTML or HTMLParser, etc.), you have methods to
obtain the text from the HTML, minus the tags. You hand this text stream to
Lucene and all is well.

Except for the offsets. The offsets are now relative to this new stream and
no longer match the original file.

I'm hoping others have confronted this issue and have some good ideas.

I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:

Given this input:

<html><head><title>Howdy there</title></head><body>Hello
world</body></html>

An HTMLTokenizer would deliver something like this sort of token stream
(the numbers represent the start/end offsets for the token):

TAG, <html>, 0, 6
TAG, <head>, 6, 12
TAG, <title>, 12, 18
WORD, Howdy, 18, 22
WORD, there, 23, 28
TAG, </title>, 28, 36
etc.

Given the above, a filter could then strip out the HTML, but pass the WORDs on
to Lucene, preserving the offsets in the source file. These would be used later
during highlighting. Clever filters could be selective about what gets 
stripped and
what gets passed on.

Has anyone else hit this problem (and hopefully solved it)? Any suggestions
appreciated, as always.

Thanks,

Fred Toth


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message