lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Odp.: solr issue with pdf forms
Date Fri, 24 Apr 2015 18:15:00 GMT
Steve:

Right, it's not exactly obvious. Bring up the admin UI, something like
http://localhost:8983/solr. From there you have to select a core in
the 'core selector' drop-down on the left side. If you're using
SolrCloud, this will have a rather strange name, but it should be easy
to identify what collection it belongs to.

At that point you'll see a bunch of new options, among them "schema
browser". From there, select your field from the drop-down that will
appear, then a button should pop up "load term info".

NOTE: you can get the same information from the TermsComponent, see:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
This is a little more flexible because you can, among other things,
specify the place to start. In your case you might specify
terms.prefix=mein which will show you the terms that are actually
being _searched_ as opposed to being stored. This latter is what you
see in the browser when you search for docs and is sometimes
misleading as you're (probably) seeing.

Best,
Erick

On Fri, Apr 24, 2015 at 1:58 AM,  <Steve.Scholl@t-systems.com> wrote:
> Hey Erick,
>
> thanks a lot for your answer. I went to the admin schema browser, but what should I see
there? Sorry I'm not firm with the admin schema browser. :-(
>
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Donnerstag, 23. April 2015 18:00
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was sent to Solr, _not_
the actual tokens in the index. What do you see when you go to the admin schema browser pate
and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you see in the browser
when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see your analysis chain,
i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but
I've certainly been wrong before, more times than I want to remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <Steve.Scholl@t-systems.com> wrote:
>> Hey Erick,
>>
>> thanks for your answer. They are not indexed correctly. Also throught the solr admin
interface I see these typical questionmarks within a rhombus where a blank space should be.
>> I now figured out the following (not sure if it is relevant at all):
>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>> indexed correctly, no issues
>> - PDF documents (with editable form fields) created with "Adobe
>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>> Gesendet: Mittwoch, 22. April 2015 17:11
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Are they not _indexed_ correctly or not being displayed correctly?
>> Take a look at admin UI>>schema browser>> your field and press the "load
terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked
like.
>>
>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy
of your original input. My guess is that your browser isn't using the compatible character
encoding for display.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 22, 2015 at 7:08 AM,  <Steve.Scholl@t-systems.com> wrote:
>>> Thanks for your answer. Maybe my English is not good enough, what are you trying
to say? Sorry I didn't get the point.
>>> :-(
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: LAFK [mailto:tomasz.borek@gmail.com]
>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>>> Betreff: Odp.: solr issue with pdf forms
>>>
>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>
>>> @LAFK_PL
>>>   Oryginalna wiadomość
>>> Od: Steve.Scholl@t-systems.com
>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>> Do: solr-user@lucene.apache.org
>>> Odpowiedz: solr-user@lucene.apache.org
>>> Temat: solr issue with pdf forms
>>>
>>> Hi guys,
>>>
>>> hopefully you can help me with my issue. We are using a solr setup and have the
following issue:
>>> - usual pdf files are indexed just fine
>>> - pdf files with writable form-fields look like this:
>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>>> v ollständig sind
>>>
>>> Somehow the blank space character is not indexed correctly.
>>>
>>> Is this a know issue? Does anybody have an idea?
>>>
>>> Thanks a lot
>>> Best
>>> Steve

Mime
View raw message