From "Mike Klaas" <>
Subject Re: highlighting/summarizing and solr
Date Thu, 22 Jun 2006 22:32:53 GMT
On 6/22/06, Chris Hostetter <> wrote:

> : It does seem like it would be easier for clients to parse document
> : associated data if it is included directly in the <doc> element.
> I acctually like the idea that it's included seperately ... it's really
> not that much harder to get at then if it's in the individual documents,
> and it makes it really easy to differentiate beteen "stored fields" of the
> document and "highlighted" info about the document .. especially if
> highlighting can be applied to non stored fields using TermVectors.

I'm inclined to agree.  Note: term vectors are not sufficient in
themselves to produce a highlit fragment.  Hightlighter does not have
the support.  It could be added, but as they not include punctuation
or whitespace, and the tokens they produce aren't always
asthetically-pleasing (eg. they may be stemmed words)., the summaries
may look a little strange.

More useful would be to emit a list of document offsets rather than
summaries; these can be used by an external application to extract

> It also allows the highlighting section of the response to include a lot
> of extra data about the highlighted snippets, that would be cumbersome to
> try and fit into the <doc>.  I started hypothisizing down this road in
> this old message...
> ...but didn't really get to some of the crazier things you could do with
> it (like reporting back where in the document a snippet starts)

Something along these lines seems reasonable (that we came up with
near-identical schema reinforces that).  I originally had a list per
field for multiple fragments as well, though scrapped it for

Does breaking down the highlit segments give significantly more power
to the user over simply allowing a custom Formatter?

> : I'm not sure if this is really the property of a field.
> : Another possibility is using init params in the request handler
> : defined in solrconfig.xml, with the possibility of overriding them in
> : a request.
> I agree with Yonik .. it might be usefull if there was a "suggested
> higherlighter configuration" at the Field/FiledType level ...  but this
> really seems like a RequestHandler configue option to me (where hte
> RequestHandler can decide wether to have a query time option to override
> it'se behavior).  That way you can have one instance of the
> XyzRequestHandler which does highlighting on the "title" field, and
> another instance with different init params that does highlighting on both
> the "title" and "summary" fields, and another with different init params
> that does summarizing/highlighting accross the title/summary and body
> fields only returning the most relevent snippets (where there can be
> snippet weighting based on field importance or something)
> those should all be up to the person configuring the way the queries work
> -- not the guy designing the schema.

Not unreasonable.  Any objections to augmenting StandardRequestHandler
with the ability to store config-time param defaults (as DisMax does

> Assuming that's an invarient, you could add an option to the request
> handler to use a custom analyzer for the purposes of highlighting stored
> fields (independed of the field type) ... that doesn't really help the
> TermVectors situation, but assuming that invarient the onlything that
> can help you hear is using an indexing analyzer that doesn't produce
> multiple tokens at the same position.

It's actually less of a problem with term vectors as their use by
Highlighter chooses only one token among the possibilities.  I'll see
if I can get that fixed in lucene.

Should I submit a patch as a starting point?

