lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reece <liquidp...@gmail.com>
Subject Re: Indexing content, storing html
Date Fri, 22 Feb 2008 19:02:42 GMT
I did this as well, but found problems when searching (tags in between
words caused searching nightmares).  I recommend stripping out all the
tags using the HTMLTokenFilterFactory or your own regex when indexing,
and storing the actual HTML in an actual database.

If you really want to store the HTML though, you can use cdata in the
xml like this:

<?xml version="1.0" encoding="UTF-8" ?>
        <add>
            <doc>
                <field name="id">123</field>
                <field name="title"><![CDATA[yourbightmlstring]]></field>
            </doc>
      </add>

The CDATA thing will basically say anything between it's tag's will be
rendered as the field value.  It only breaks if your html string has a
"]]>" in it to end the data tag.

-Reece



On Fri, Feb 22, 2008 at 12:19 PM, Paul deGrandis
<paul.degrandis@gmail.com> wrote:
> Hi all,
>
>  I'm working on a solr app that pulls HTML from an embedded JavaScript
>  WYSIWYG editor, and I need to index on the content, but store and
>  reproduce the HTML.  The problem I have is when I try to add and
>  commit, the HTML gets interpreted as XML.  Is the way to do this
>  properly to create an HTMLTokenFilterFactory?  And if so, is there a
>  collection of plugins (like filters and such) that someone can point
>  me to?
>
>  Regards,
>  Paul
>

Mime
View raw message