lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: how to present html content in browse
Date Fri, 04 May 2012 17:27:36 GMT
1. The raw html field (call it, "text_html") would be a "string" type field 
that is "stored" but not "indexed". This is the field you direct DIH to 
output to. This is the field you would return in your search results with 
the HTML to be displayed.

2. The stripped field (call it, "text_stripped") would be a "text" type 
field (where "text" is a field type you add that uses the HTML strip char 
filter as shown below) that is not "stored" but is "indexed. Add a CopyField 
to your schema that copies from the raw html field to the stripped field 
(say, "text_html" to "text_stripped".)

For reference on HTML strip (HTMLStripCharFilterFactory), see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Which has:

<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>

Although, you might want to call that field type "text_stripped" to avoid 
confusion with a simple text field

You can add HTMLStripCharFilterFactory to some other field type that you 
might want to use, but this "charFilter" needs to be before the "tokenizer". 
The "text" field type above is just an example.

-- Jack Krupansky

-----Original Message----- 
From: okayndc
Sent: Friday, May 04, 2012 1:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to present html content in browse

Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string field
or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <goksron@gmail.com> wrote:

> Make two fields, one with stores the stripped HTML and another that
> stores the parsed HTML. You can use <copyField> so that you do not
> have to submit the html page twice.
>
> You would mark the stripped field 'indexed=true stored=false' and the
> full text field the other way around. The full text field should be a
> String type.
>
> On Thu, May 3, 2012 at 1:04 PM, srini <softtech88@gmail.com> wrote:
> > I am indexing records from database using DIH. The content of my record
> is in
> > html format. When I use browse
> > I would like to show the content in html format, not in text format. Any
> > ideas?
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
> 


Mime
View raw message