lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arturas Mazeika <maze...@gmail.com>
Subject Re: querying vs. highlighting: complete freedom?
Date Mon, 26 Mar 2018 07:12:44 GMT
Hi Erick,

Adding a field-qualify to the hl.q parameter solved the issue. My
excitement is steaming over the roof! What a thorough answer: the
explanation about the behavior of solr, how it tries to interpret what I
mean when I supply a keyword without the field-qualifier. Very impressive.
Would you care (re)posting this answer to stackoverflow? If that is too
much of a hassle, I'll do this in a couple of days myself on your behalf.

I am impressed how well, thorough, fast and fully the question was answered.

Steven hint pushed me into this direction further: he suggested to use the
query part of solr to filter and sort out the relevant answers in the 1st
step and in the 2nd step he'd highlight all the keywords using CTR+F (in
the browser or some alternative viewer). This brought be to the next
question:

How can one match query terms with the analyze-chained documents in an
efficient and distributed manner? My current understanding how to achieve
this is the following:

1. Get the list of ids (contents) of the documents that match the query
2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
document and the query
3. Use the matching of the substrings from the original text to last
filter/tokenizer/analyzer in the analyze-chain to map the terms of the query
4. Emulate CTRL+F highlighting

Web Interface of Solr offers quite a bit to advance towards this goal. If
one fires this request:

* analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was a
German-born theoretical physicist[5] who developed the theory of
relativity, one of the two pillars of modern physics (alongside quantum
mechanics).&
* analysis.query=reletivity theory

to one of the cores of solr, one gets the steps 1-3 done:

http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml&analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14%20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%20of%20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=reletivity%20theory&analysis.fieldtype=text_en

Questions:

1. Is there a way to "load-balance" this? In the above url, I need to
specify a specific core. Is it possible to generalize it, so the core that
receives the request is not necessarily the one that processes it? Or this
already is distributed in a sense that receiving core and processing cores
are never the same?

2. The document was already analyze-chained. Is is possible to store this
information so one does not need to re-analyze-chain it once more?

Cheers
Arturas

On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Arturas:
>
> Try to field-qualify your hl.q parameter. That looks like:
>
> hl.q=trans:Kundigung
> or
> hl.q=trans:Kündigung
>
> I saw the exact behavior you describe when I did _not_ specify the
> field in the hl.q parameter, i.e.
>
> hl.q=Kundigung
> or
> hl.q=Kündigung
>
> didn't show all highlights.
>
> But when I did specify the field, it worked.
>
> Here's what I think is happening: Solr uses the default search
> field when parsing an un-field-qualified query. I.e.
>
> q=something
>
> is parsed as
>
> q=default_search_field:something.
>
> The default field is controlled in solrconfig.xml with the "df"
> parameter, you'll see entries like:
> <str name="df">my_field</str>
>
> Also when I changed the "df" parameter to the field I was highlighting
> on, I didn't need to specify the field on the hl.q parameter.
>
> hl.q=Kundigung
> or
> hl.q=Kündigung
>
> The default  field is usually "text", which knows nothing about
> the German-specific filters you've applied unless you changed it.
>
> So in the absence of a field-qualification for the hl.q parameter Solr
> was parsing the query according to the analysis chain specifed
> in your default field, and probably passed ü through without
> transforming it. Since your indexing analysis chain for that field
> folded ü to just plain u, it wasn't found or highlighted.
>
> On the surface, this does seem like something that should be
> changed, I'll go ahead and ping the dev list.
>
> NOTE: I was trying this on Solr 7.1
>
> Best,
> Erick
>
> On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika <mazeika@gmail.com>
> wrote:
> > Hi Erick,
> >
> > Thanks for the update and the infos. Your post brought quite a bit of
> light
> > into the picture and now I understand quite a bit more about what you are
> > saying. Your explanation makes sense and can be quite useful in certain
> > scenarious.
> >
> > What stroke me from your description is that you are saying that the
> > analyzer-chain needs to be applied for the highlighting queries as well.
> > The tragedy is that I am not able to get this for a german collection: if
> > the query is set (no explicit highlighting query), the highlighting is
> > correct. It is also correct, if I replace the umaults into the
> > corresponding latin chars. Getting the analyzer chain for the
> highlighting
> > terms remains the challenge.
> >
> > Do you think you have a look at the following stakoverflow link? Maybe
> > something comes to your mind...
> >
> > *https://stackoverflow.com/questions/49276093/solr-
> highlighting-terms-with-umlaut-not-found-not-highlighted
> > <https://stackoverflow.com/questions/49276093/solr-
> highlighting-terms-with-umlaut-not-found-not-highlighted>*
> >
> > *Cheers,*
> >
> > *Arturas*
> > On Fri, Mar 23, 2018, 17:43 Erick Erickson <erickerickson@gmail.com>
> wrote:
> >
> >> bq: this is not a typical case that one searches for a keyword but
> >> highlights something else
> >>
> >> This isn't really an unusual case, apparently I mislead you.
> >>
> >> What I was trying to convey is that the analysis chain used is firmly
> >> attached to a particular _field_. There's no way to say "use one
> >> analysis chain for the query and another for highlighting on the
> >> _same_ field".
> >>
> >> You can use two different fields with different analysis chains, one
> >> for each purpose. So something like
> >>
> >> q=f1:something&hl.fl=f2,f3&hl.q=other
> >>
> >> is certainly reasonable. It'll search for "something" in f1, and
> >> highlight "other" in f2 and f3
> >>
> >> Each fields processes its input with the analysis chain defined in the
> >> schema.
> >>
> >> The rest about stored="true" can be ignored, it's just me wandering
> >> off into the weeds about an optimization that only stores the data
> >> once rather than redundantly in multiple fields.
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Mar 23, 2018 at 4:37 AM, Arturas Mazeika <mazeika@gmail.com>
> >> wrote:
> >> > Hi Mathesis (Stefan),
> >> >
> >> > Thanks for the questions. This made me look at the problem from a
> >> distance
> >> > and re-frame the situation. Good questions indeed.
> >> >
> >> > Trying to go around: consider a user who describes herself as being a
> BMW
> >> > fan, being convinced that all BMW need to be the blackest color
> possible
> >> > (for a sake of argument) who would like to search and later browse the
> >> > entries in the discussion forum (of course not everything but BMW of
> the
> >> > blackest color), and what interest her are the snippets that have
> >> > understood, craziest as keywords or the like (because she is looking
> for
> >> a
> >> > dozen of discussions that she saw before).
> >> >
> >> > What I was not able to achieve so far is: (i) combine query term for
> >> > filtering and highlighting, (ii) using the analyzer-chain from the
> >> > attribute to rewrite the highlight query (or define one in the search)
> >> >
> >> > CTR+F technique is a very powerful one, indeed. Works most of the
> time.
> >> The
> >> > difficulties with it are query rewriting, enriching, etc.
> >> >
> >> > Cheers,
> >> > Arturas
> >> >
> >> > On Fri, Mar 23, 2018 at 11:29 AM, Stefan Matheis <
> >> matheis.stefan@gmail.com>
> >> > wrote:
> >> >
> >> >> Perhaps we try it the other way round .. what's your use case for
> this?
> >> I'm
> >> >> trying to think of a situation where I'd need this a as user?
> >> >>
> >> >> The only reason I see myself doing this is CTRL+F in a page when the
> >> search
> >> >> result is not  immediately visible for me ;)
> >> >>
> >> >> On Mar 23, 2018 9:41 AM, "Arturas Mazeika" <mazeika@gmail.com>
> wrote:
> >> >>
> >> >> > Hi Erick et al,
> >> >> >
> >> >> > From your answer I understand that this is not a typical case
that
> one
> >> >> > searches for a keyword but highlights something else. Since we
have
> >> two
> >> >> > parameters (q vs hl.q) I thought they are freely combinable. From
> your
> >> >> > answer I understand that this is not really the case. My current
> >> >> > understanding came from [1] that says:
> >> >> >
> >> >> > hl.q
> >> >> >
> >> >> > A query to use for highlighting. This parameter allows you to
> >> highlight
> >> >> > different terms than those being used to retrieve documents.
> >> >> > what I hear from you is something different: i.e., that this is
not
> >> >> enough
> >> >> > just to combine the q with hl.q, that there are caveats to achieve
> the
> >> >> task
> >> >> > (multiple fields, FastVectorHighlighter).
> >> >> >
> >> >> > Your infos are very helpful.
> >> >> >
> >> >> > Cheers,
> >> >> > Arturas
> >> >> >
> >> >> > [1]  https://lucene.apache.org/solr/guide/7_2/highlighting.html
> >> >> >
> >> >> > On Thu, Mar 22, 2018 at 4:07 PM, Erick Erickson <
> >> erickerickson@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> > > Basically you need to use a copyField, but in several variants:
> >> >> > >
> >> >> > > If you use the field _exclusively_ for highlighting then
store
> the
> >> raw
> >> >> > > content there and have the field use whatever analyzer you
want.
> You
> >> >> > > do _not_ need to have indexed="true" set for the field if
you're
> >> >> > > highlighting on the fly. So you're searching against field1
> (which
> >> has
> >> >> > > indexed="true" stored="false" set) but highlighting against
> field2
> >> >> > > (which has indexed="false" stored="true" set). Of course
any time
> >> you
> >> >> > > want to return the contents in a doc your fl needs to specify
> >> >> > > field2...
> >> >> > >
> >> >> > > The above does not bloat your index at all since the cost
of
> >> >> > > stored="true" indexed="true" is the same as if you use two
> fields,
> >> >> > > each with only one option turned on.
> >> >> > >
> >> >> > > The second approach if you want to use FastVectorHighlighter
or
> the
> >> >> > > like is simply to index both fields.
> >> >> > >
> >> >> > > Best,
> >> >> > > Erick
> >> >> > >
> >> >> > > On Thu, Mar 22, 2018 at 2:18 AM, Arturas Mazeika <
> mazeika@gmail.com
> >> >
> >> >> > > wrote:
> >> >> > > > Hi Solr-Users,
> >> >> > > >
> >> >> > > > I've been playing with a german collection of documents,
where
> I
> >> >> tried
> >> >> > to
> >> >> > > > search for one word (q=Tag) and highlighted another:
> >> >> (hl.q=Kundigung).
> >> >> > Is
> >> >> > > > this a "legal" use case? My key question is how can
I tell solr
> >> which
> >> >> > > query
> >> >> > > > analyzer to use for highlighting? Strictly speaking,
I should
> use
> >> >> > > > hl.q=Kündigung to conceptually look for relevant information,
> but
> >> in
> >> >> > this
> >> >> > > > case, no highlighting is returned (as all umlauts are
left out
> in
> >> the
> >> >> > > > index) .
> >> >> > > >
> >> >> > > > Additional infos:
> >> >> > > >
> >> >> > > > solr version: 7.2
> >> >> > > > urls to query:
> >> >> > > >
> >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
> >> >> > > >
> >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> >> >> > > true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.snippets=3&wt=xml&rows=1
> >> >> > > > <http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=
> >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1>
> >> >> > > >
> >> >> > > > Managed-schema:
> >> >> > > >
> >> >> > > >   <fieldType name="text_de" class="solr.TextField"
> >> >> > > positionIncrementGap="100">
> >> >> > > >     <analyzer>
> >> >> > > >       <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >> > > >       <filter class="solr.LowerCaseFilterFactory"/>
> >> >> > > >       <filter class="solr.StopFilterFactory" format="snowball"
> >> >> > > > words="lang/stopwords_de.txt" ignoreCase="true"/>
> >> >> > > >       <filter class="solr.GermanNormalizationFilterFactory"/>
> >> >> > > >       <filter class="solr.GermanLightStemFilterFactory"/>
> >> >> > > >     </analyzer>
> >> >> > > >   </fieldType>
> >> >> > > >
> >> >> > > >
> >> >> > > > Other additional infos:
> >> >> > > > https://stackoverflow.com/questions/49276093/solr-
> >> >> > > highlighting-terms-with-umlaut-not-found-not-highlighted
> >> >> > > >
> >> >> > > > Cheers,
> >> >> > > > Arturas
> >> >> > >
> >> >> >
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message