lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: HTML tag filter...
Date Sat, 10 Jan 2004 20:31:49 GMT
On Jan 10, 2004, at 1:43 PM, ambiesense@gmx.de wrote:
> would it be possible to implement a Analyser who filters HTML code out 
> of a
> HTML page. As a result I would have only the text free of any tagging.

The dilemma is that in a general sense there are multiple fields in 
HTML.  At least "title" and "body", and perhaps others from metadata.  
An Analyzer operates only on a single field at a time so cannot split 
its input into multiple fields.

But, yes, it would be possible to strip bracketed text out.  My 
recommendation for such would be to implement a Tokenizer to do this 
that could be used to feed an analysis process.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message