lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Displaying highlights in formatted HTML document
Date Thu, 09 Jun 2011 06:55:49 GMT


--- On Thu, 6/9/11, Bryan Loofbourrow <bloofbourrow@knowledgemosaic.com> wrote:

> From: Bryan Loofbourrow <bloofbourrow@knowledgemosaic.com>
> Subject: Displaying highlights in formatted HTML document
> To: solr-user@lucene.apache.org
> Date: Thursday, June 9, 2011, 2:14 AM
> Here is my use case:
> 
> 
> 
> I have a large number of HTML documents, sizes in the
> 0.5K-50M range, most
> around, say, 10M.
> 
> 
> 
> I want to be able to present the user with the formatted
> HTML document, with
> the hits tagged, so that he may iterate through them, and
> see them in the
> context of the document, with the document looking as it
> would be presented
> by a browser; that is, fully formatted, with its tables and
> italics and font
> sizes and all.
> 
> 
> 
> This is something that the user would explicitly request
> from within a set
> of search results, not something I’d expect to have
> returned from an initial
> search – the initial search merely returns the snippets
> around the hits. But
> if the user wants to dive into one of the returned results
> and see them in
> context, I need to be able to go get that.
> 
> 
> 
> We are currently solving this problem by using an entirely
> separate search
> engine (dtSearch), which performs the tagging of the hits
> in the HTML just
> fine. But the solution is unsatisfactory because there are
> Solr searches
> that dtSearch’s capabilities cannot reasonably match.
> 
> 
> 
> Can anyone suggest a good way to use Solr/Lucene for this
> instead? I’m
> thinking a separate core for this purpose might make sense,
> so as not to
> burden the primary search core with the full contents of
> the document. But
> after that, I’m stuck. How can I get Solr to express the
> highlighting in the
> context of the formatted HTML document?
> 
> 
> 
> If Solr does not do this currently, and anyone can suggest
> ways to add the
> feature, any tips on how this might best be incorporated
> into the
> implementation would be welcome.

I am doing the same thing (solr trunk) using the following field type:

<fieldType name="HTMLText" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory" mapping="mappings.txt"/><tokenizer
class="solr.StandardTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true"
expand="true"/>
</analyzer><analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
</analyzer>

In your separate core - which will is queried when the user wants to dive into one of the
returned results - feed your html files in to this field. 

You may want to increase max analyzed chars too.
<int name="hl.maxAnalyzedChars">147483647</int>

Mime
View raw message