lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Steve.Sch...@t-systems.com>
Subject AW: Odp.: solr issue with pdf forms
Date Tue, 28 Apr 2015 20:04:30 GMT
Thanks a lot for being patient with me. Unfortunately there is no button "load term info".
:-(
Can you may be help me using the TermsComponent instead? I read it is per default configured.

Thanks a lot
Best
Steve

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerickson@gmail.com] 
Gesendet: Montag, 27. April 2015 17:23
An: solr-user@lucene.apache.org
Betreff: Re: Odp.: solr issue with pdf forms

We're still not quite there. There should be a "load term info" button on that page. Clicking
that button will show you the terms in your index (as opposed to the raw stored input which
is what you get when you look at results in the browser). My bet is that you'll see perfectly
normal tokens in the index that will NOT have the wonky characters you see in the display.

If that's the case, then you have a browser issue, Solr is working perfectly fine. On the
other hand, if the individual terms are weird, then you have something more fundamental going
on.

Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you
a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially
the same information.

Best,
Erick

On Sun, Apr 26, 2015 at 11:18 PM,  <Steve.Scholl@t-systems.com> wrote:
> Erick,
>
> thanks a lot for helping me here. In my case it ist he "content" field which is displayed
not correctly. So I went tot he schema browser like you pointed out. Here ist he information
I found:
> Field: content
> Field Type: text
> Properties:  Indexed, Tokenized, Stored, TermVector Stored
> Schema:  Indexed, Tokenized, Stored, TermVector Stored
> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into: 
> spell teaser Position Increment Gap:  100 Index Analyzer: 
> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:  
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory 
> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 
> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 
> catenateAll: 0 catenateNumbers: 1 } 
> org.apache.solr.analysis.LowerCaseFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: 
> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: 
> LUCENE_36 } 
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory 
> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 
> minWordSize: 5 dictionary: german/german-common-nouns.txt 
> luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.StopFilterFactory args:{words: 
> german/stopwords.txt ignoreCase: true enablePositionIncrements: true 
> luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.GermanNormalizationFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: 
> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer: 
> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:  
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory 
> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 
> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 
> catenateAll: 0 catenateNumbers: 0 } 
> org.apache.solr.analysis.LowerCaseFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.StopFilterFactory args:{words: 
> german/stopwords.txt ignoreCase: true enablePositionIncrements: true 
> luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.GermanNormalizationFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: 
> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 } 
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
> args:{luceneMatchVersion: LUCENE_36 }
> Distinct:  160403
>
> Does this somehow help to figure out the issue?
> Thanks
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Freitag, 24. April 2015 20:15
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Steve:
>
> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr.
From there you have to select a core in the 'core selector' drop-down on the left side. If
you're using SolrCloud, this will have a rather strange name, but it should be easy to identify
what collection it belongs to.
>
> At that point you'll see a bunch of new options, among them "schema browser". From there,
select your field from the drop-down that will appear, then a button should pop up "load term
info".
>
> NOTE: you can get the same information from the TermsComponent, see:
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
> This is a little more flexible because you can, among other things, specify the place
to start. In your case you might specify terms.prefix=mein which will show you the terms that
are actually being _searched_ as opposed to being stored. This latter is what you see in the
browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>
> Best,
> Erick
>
> On Fri, Apr 24, 2015 at 1:58 AM,  <Steve.Scholl@t-systems.com> wrote:
>> Hey Erick,
>>
>> thanks a lot for your answer. I went to the admin schema browser, but 
>> what should I see there? Sorry I'm not firm with the admin schema 
>> browser. :-(
>>
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>> Gesendet: Donnerstag, 23. April 2015 18:00
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> When you say "they're not indexed correctly", what's your evidence?
>> You cannot rely
>> on the display in the browser, that's the raw input just as it was sent to Solr,
_not_ the actual tokens in the index. What do you see when you go to the admin schema browser
pate and load the actual tokens.
>>
>> Or use the TermsComponent
>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>> ) to see the actual terms in the index as opposed to the stored data 
>> you see in the browser when you look at search results.
>>
>> If the actual terms don't seem right _in the index_ we need to see your analysis
chain, i.e. your fieldType definition.
>>
>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine,
but I've certainly been wrong before, more times than I want to remember.....
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 23, 2015 at 1:18 AM,  <Steve.Scholl@t-systems.com> wrote:
>>> Hey Erick,
>>>
>>> thanks for your answer. They are not indexed correctly. Also throught the solr
admin interface I see these typical questionmarks within a rhombus where a blank space should
be.
>>> I now figured out the following (not sure if it is relevant at all):
>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are 
>>> indexed correctly, no issues
>>> - PDF documents (with editable form fields) created with "Adobe 
>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>
>>> Best
>>> Steve
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>> An: solr-user@lucene.apache.org
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> Are they not _indexed_ correctly or not being displayed correctly?
>>> Take a look at admin UI>>schema browser>> your field and press the
"load terms" button. That'll show you what is _in_ the index as opposed to what the raw data
looked like.
>>>
>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy
of your original input. My guess is that your browser isn't using the compatible character
encoding for display.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Apr 22, 2015 at 7:08 AM,  <Steve.Scholl@t-systems.com> wrote:
>>>> Thanks for your answer. Maybe my English is not good enough, what are you
trying to say? Sorry I didn't get the point.
>>>> :-(
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: LAFK [mailto:tomasz.borek@gmail.com]
>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>>>> Betreff: Odp.: solr issue with pdf forms
>>>>
>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>
>>>> @LAFK_PL
>>>>   Oryginalna wiadomość
>>>> Od: Steve.Scholl@t-systems.com
>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>> Do: solr-user@lucene.apache.org
>>>> Odpowiedz: solr-user@lucene.apache.org
>>>> Temat: solr issue with pdf forms
>>>>
>>>> Hi guys,
>>>>
>>>> hopefully you can help me with my issue. We are using a solr setup and have
the following issue:
>>>> - usual pdf files are indexed just fine
>>>> - pdf files with writable form-fields look like this:
>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt 
>>>> und v ollständig sind
>>>>
>>>> Somehow the blank space character is not indexed correctly.
>>>>
>>>> Is this a know issue? Does anybody have an idea?
>>>>
>>>> Thanks a lot
>>>> Best
>>>> Steve
Mime
View raw message