lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teague James <teag...@insystechinc.com>
Subject Re: Solr Basic Configuration - Highlight - Begginer
Date Thu, 17 Dec 2015 08:11:21 GMT
Erik's comments not withstanding, there are some gaps in my understanding
of your precise situation. Here's a few things that weren't necessarily
obvious to me when I took my first try with Solr.

Highlighting is the end result of a good hit. It is essentially formatting
applied to your hit. It is possible to get a hit without a highlight if
certain conditions exist.

First, start by making sure you are indexing your target (a PDF file?)
correctly. Assuming you are indexing PDFs, are you extracting meta data
only or are you parsing the document with Tika? If you want hits on the
contents of your PDF, then you have to parse it at index time and store
that.That was why I suggested just running some queries through the
interface and the URL to see what Solr actually captured from your indexed
PDF before worrying about how it looks on the screen.

Next, you should look carefully at the Analyzer's output. Notice the
abbreviations to the left of the columns? Hover over those to see what
filter factory it is. When words are split into multiple columns at one of
those points, it indicates that the filter factory broke apart the word
while analyzing it. Do a search for the filter filter factories that you
find and read up on them. In my case "1a" was being split into 4 by a word
delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
to fail in my case while still getting a hit. It also caused erroneous hits
elsewhere. Adding some switches to the schema is all it took to correct
that for me. However, every case is different based on your needs. That is
why it is important to go through the analyzer and see if Solr's indexing
and querying are doing what you expect.

If that looks good and you've got solid hits all the way down, then it is
time to start looking at your highlighter implementation in the index and
query analyzers that you are using. My original issue of not being able to
highlight phrases with one set of tags necessitated me switching to the
fast vector highlighter - which had its own requirements for certain
parameters to be set. Here again - going to the Solr docs and reading up on
the various highlighters will be helpful in most cases.

Solr has a very steep learning curve. I've been using it for several years
and I still consider myself a noob. It can be a deep dive, but don't be
discouraged. Keep at it. Cheers!

-Teague

On Wed, Dec 16, 2015 at 8:54 PM, Evert R. <evert.ramos@gmail.com> wrote:

> Hi Erick and Teague,
>
>
> I found that when using the field 'text' it shows the pdf file result
> id:pdf1 in this case, like:
>
> http://localhost:8983/solr/techproducts/select?fq=id:pdf1&q=nietava
>
> but when highlight, using the text field...nothing comes up...
>
>
> http://localhost:8983/solr/techproducts/select?q=text:nietava&fq=id:pdf1&wt=json&indent=true&hl=true&hl.fl=text&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E
>
> ​of even with the option
>
> f.text.hl.snippets=2 under the hl.fl field.
>
>
> I tried as well with the standard configuration, did it all over, reindexed
> a couple times... and still did not work.
>
> Also,
>
> Using the Analysis, it brings below information:
>
> ST
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> SF
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> LCF
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> ​
>
> Alphanumeric I think... so, it´s 'string', right? would that be a problem?
> Should be some other indication?
>
>
> Thanks again!
>
>
> *Evert*
>
> 2015-12-16 21:09 GMT-02:00 Erick Erickson <erickerickson@gmail.com>:
>
> > I think you're still missing the critical bit. Highlighting is
> > completely separate from searching. In other words, you can search on
> > one field and highlight another. What field is searched is governed by
> > the "qf" parameter when using edismax and by the the "df" parameter
> > configured in your request handler in solrconfig.xml. These defaults
> > are overridden when you do a "fielded search" like
> >
> > q=content:nietava
> >
> > So this: q=content:nietava&hl=true&hl.fl=content
> > is searching the "content" field. The word you're looking for isn't in
> > the content field so naturally no docs are returned. And no
> > highlighting either.
> >
> > This: q=nietava&hl=true&hl.fl=content
> >
> > is searching somewhere else, thus getting the hit. We already know
> > that "nietava" is not in the content field because the first search
> > failed. You need to find out what field is being matched (probably
> > something like "text") and then try highlighting on _that_ field. Try
> > adding "debug=query" to the URL and look at the "parsed_query" section
> > of the return and you'll see what field(s) is/are actually being
> > searched against.
> >
> > NOTE: The field you highlight on _must_ have stored="true" in schema.xml.
> >
> > As to why "nietava" isn't being found in the content field, probably
> > you have some kind of analysis chain configured for that field that
> > isn't searching as you expect. See the admin/analysis page for some
> > insight into why that would be. The most frequent reason is that the
> > field is a "string" type which is not broken up into words. Another
> > possibility is that your analysis chain is leaving in the quotes or
> > something similar. As James says, looking at admin/analysis is a good
> > way to figure this out.
> >
> > I still strongly recommend you go from the stock techproducts example
> > and get familiar with how Solr (and highlighting) work before jumping
> > in and changing things. There are a number of ways things can be
> > mis-configured and trying to change several things at once is a fine
> > way to go mad. The admin UI>>schema browser is another way you can see
> > what kind of terms are _actually_ in your index in a particular field.
> >
> > Best,
> > Erick
> >
> >
> >
> >
> > On Wed, Dec 16, 2015 at 12:26 PM, Teague James <teaguej@insystechinc.com
> >
> > wrote:
> > > Sorry to hear that didn't work! Let me ask a couple of questions...
> > >
> > > Have you tried the analyzer inside of the Admin Interface? It has
> helped
> > me sort out a number of highlighting issues in the past. To access it, go
> > to your Admin interface, select your core, then select Analysis from the
> > list of options on the left. In the analyzer, enter the term you are
> > indexing in the top left (in other words the term in the document you are
> > indexing that you expect to get a hit on) and right input fields. Select
> > the field that it is destined for (in your case that would be 'content'),
> > then hit analyze. Helps if you have a big screen!
> > >
> > > This will show you the impact of the various filter factories that you
> > have engaged and their effect on whether or not a 'hit' is being
> generated.
> > Hits are idietified by a very feint highlight. (PSST... Developers... It
> > would be really cool if the highlight color were more visible or
> > customizable... Thanks y'all) If it looks like you're getting hits, but
> not
> > getting highlighting, then open up a new tab with the Admin's query
> > interface. Same place on the left as the analyzer. Replace the "*:*" with
> > your search term (assuming you already indexed your document) and if
> > necessary you can put something in the FQ like "id:123456" to target a
> > specific record.
> > >
> > > Did you get a hit? If no, then it's not highlighting that's the issue.
> > If yes, then try dumping this in your address bar (using your URL/IP,
> > search term, and core name of course. The fq= is an example) :
> > > http://[URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]"
> > >
> > > That will dump Solr's output to your browser where you can see exactly
> > what is getting hit.
> > >
> > > Hope that helps! Let me know how it goes. Good luck.
> > >
> > > -Teague
> > >
> > > -----Original Message-----
> > > From: Evert R. [mailto:evert.ramos@gmail.com]
> > > Sent: Wednesday, December 16, 2015 1:46 PM
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: Solr Basic Configuration - Highlight - Begginer
> > >
> > > Hi Teague!
> > >
> > > I configured the solrconf.xml and schema.xml exactly the way you did,
> > only substituting the word 'documentText' per 'content' used by the
> > techproducts sample, I reindex through :
> > >
> > >  curl '
> > >
> >
> http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true
> > '
> > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf"
> > >
> > > with the same result.... no highlight in the respond as below:
> > >
> > > "highlighting": { "pdf1": {} }
> > >
> > > =(
> > >
> > > Really... do not know what to do...
> > >
> > > Thanks for your time, if you have any more suggestion where I could be
> > missing something... please let me know.
> > >
> > >
> > > Best regards,
> > >
> > > *Evert*
> > >
> > > 2015-12-16 15:30 GMT-02:00 Teague James <teaguej@insystechinc.com>:
> > >
> > >> Hi Evert,
> > >>
> > >> I recently needed help with phrase highlighting and was pointed to the
> > >> FastVectorHighlighter which worked out great. I just made a change to
> > >> the configuration to add generateWordParts="0" and
> > >> generateNumberParts="0" so that searches for things like "1a" would
> > >> get highlighted correctly. You may or may not need that feature. You
> > >> can always remove them or change the value to "1" to switch them on
> > explicitly. Anyway, hope this helps!
> > >>
> > >> solrconfig.xml (partial snip)
> > >> <requestHandler name="/select" class="solr.SearchHandler">
> > >>                 <lst name="defaults">
> > >>                         <str name="wt">xml</str>
> > >>                         <str name="echoParams">explicit</str>
> > >>                         <int name="rows">10</int>
> > >>                         <str name="df">documentText</str>
> > >>                         <str name="hl">on</str>
> > >>                         <str name="hl.fl">text</str>
> > >>                         <str
> > name="hl.useFastVectorHighlighter">true</str>
> > >>                         <str name="hl.snippets">100</str>
> > >>                         <str name="hl.tag.pre"><b></str>
> > >>                         <str name="hl.tag.post"></b></str>
> > >>                 </lst>
> > >> </requestHandler>
> > >>
> > >> schema.xml (partial snip)
> > >>    <field name="id" type="string" indexed="true" stored="true"
> > >> required="true" multiValued="false" />
> > >>    <field name="documentText" type="text_general" indexed="true"
> > >> multivalued="true" termVectors="true" termOffsets="true"
> > >> termPositions="true" />
> > >>
> > >> <fieldType name="text_general" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>         <analyzer type="index">
> > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >> words="stopwords.txt" />
> > >>                 <filter class="solr.WordDelimiterFilterFactory"
> > >> catenateAll="1" preserveOriginal="1" generateNumberParts="0"
> > >> generateWordParts="0" />
> > >>                 <filter class="solr.SynonymFilterFactory"
> > >> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
> > >>                 <filter class="solr.LowerCaseFilterFactory"/>
> > >>                 <filter class="solr.PorterStemFilterFactory"/>
> > >>                 <filter class="solr.ApostropheFilterFactory"/>
> > >>         </analyzer>
> > >>         <analyzer type="query">
> > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>                 <filter class="solr.WordDelimiterFilterFactory"
> > >> catenateAll="1" preserveOriginal="1" generateWordParts="0" />
> > >>                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > >> words="stopwords.txt" />
> > >>                 <filter class="solr.LowerCaseFilterFactory"/>
> > >>                 <filter class="solr.ApostropheFilterFactory"/>
> > >>         </analyzer>
> > >> </fieldType>
> > >>
> > >> -Teague
> > >>
> > >> From: Evert R. [mailto:evert.ramos@gmail.com]
> > >> Sent: Tuesday, December 15, 2015 6:25 AM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Solr Basic Configuration - Highlight - Begginer
> > >>
> > >> Hi there!
> > >>
> > >> It´s my first installation, not sure if here is the right channel...
> > >>
> > >> Here is my steps:
> > >>
> > >> 1. Set up a basic install of solr 5.4.0
> > >>
> > >> 2. Create a new core through command line (bin/solr create -c test)
> > >>
> > >> 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/)
> > >>
> > >> 4. Query over the browser and it brings the correct search, but it
> > >> does not show the part of the text I am querying, the highlight.
> > >>
> > >>   I have already flagled the 'hl' option. But still it does not
> word...
> > >>
> > >> Exemple: I am looking for the word 'peace' in my pdf file (book) I
> > >> have 4 matches for this word, it shows me the book name (pdf file) but
> > >> does not bring which part of the text it has the word peace on it.
> > >>
> > >>
> > >> I am problably missing some configuration in schema.xml, which is
> > >> missing from my folder.... /solr/server/solr/test/conf/
> > >>
> > >> Or even the solrconfig.xml...
> > >>
> > >> I have read a bunch of things about highlight check these files,
> > >> copied the standard schema.xml to my core/conf folder, but still it
> > >> does not bring the highlight.
> > >>
> > >>
> > >> Attached a copy of my solrconfig.xml file.
> > >>
> > >>
> > >> I am very sorry for this, probably, dumb and too basic question...
> > >> First time I see solr in live.
> > >>
> > >>
> > >> Any help will be appreciated.
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >>
> > >> Evert Ramos
> > >>
> > >> mailto:evert.ramos@gmail.com
> > >>
> > >>
> > >>
> > >
> >
>



-- 
Kind regards,

-Teague James
*Senior Web Applications Developer*
Insystech Inc.
teaguej@insystechinc.com
(703) 508-0008 (Cell)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message