lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: best practice handling html content
Date Mon, 19 Apr 2010 16:19:52 GMT

> we want to index and search in our intranet documents.
> the field "body" contains html-tags.
> 
> in our schema.xml we have a fieldType text_de (see at the
> end of this mail) which uses charFilter
> solr.HTMLStripCharFilterFactory with index. 
> so this is no problem. the text is put into the index
> without any html. i can do search over this field, also html
> entities like &auml; for a german umlaut (รค) do work,
> &nbsp; are filtered out correct, support for german
> language etc.
> 
> so now comes the problem. the field body is defined like
> 
> <field name="body" type="text_de" indexed="true"
> stored="true" />
> 
> so we do index it and also store the content. on the result
> page when we are printing body or the highlighing on body we
> have all the html tags back. sounds correct, as the
> HTML-Filter only works on the indexing...
> 
> so my question is, how is the best way to handle this case?
> strip out all html before adding the document to the index.

I think this is the best way to do it if you want to display html-stripped content.  By doing
so you will save disk space too. 

Similar discussion: http://search-lucene.com/m/hyKqg1MJEDL



      

Mime
View raw message