lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?
Date Fri, 03 Jun 2005 17:06:25 GMT
Fred Toth wrote:
> I'm thinking we need something like "HTMLTokenizer" which bridges the
> gap between StandardAnalyzer and an external HTML parser. Since so
> many of us are dealing with HTML, I would think this would be generally
> useful for many problems. It could work this way:
> 
> Given this input:
> 
> <html><head><title>Howdy there</title></head><body>Hello

> world</body></html>
> 
> An HTMLTokenizer would deliver something like this sort of token stream
> (the numbers represent the start/end offsets for the token):
> 
> TAG, <html>, 0, 6
> TAG, <head>, 6, 12
> TAG, <title>, 12, 18
> WORD, Howdy, 18, 22
> WORD, there, 23, 28
> TAG, </title>, 28, 36
> etc.
> 
> Given the above, a filter could then strip out the HTML, but pass the 
> WORDs on
> to Lucene, preserving the offsets in the source file. These would be 
> used later
> during highlighting. Clever filters could be selective about what gets 
> stripped and
> what gets passed on.

For what it's worth, I think that's a good design and would love to see 
this as a contribution.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message