lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Leir <rl...@leirtech.com>
Subject Re: recent utf8 problems
Date Tue, 07 Nov 2017 11:36:50 GMT
Dr Krell
Item 11): It is best to get the solrconfig.xml provided with the new version of Solr, and
change it to suit your needs. Do not try to work from the old version's solrconfig.xml.

I did not have time to read the other items. 

Look in solr.log, and compare the successful query with the unsuccessful one for clues, then
look at the config for /select again.
Cheers -- Rick

On November 7, 2017 12:43:00 AM EST, "Dr. Mario Michael Krell" <krell@uni-bremen.de>
wrote:
>Hi,
>
>thank you for your time and trying to narrow down my problem.
>
>1) When looking for Tübingen in the title, I am expecting the 3092484
>results. That sounds like a reasonable result. Furthermore, when
>looking at some of the results, they are exactly what I am looking for.
>
>2) I am testing them against the same solr server. This is a very
>simple testing setup, that brings our problem to the core. Originally,
>we used a urlib.request.urlopen query to get the data in Python and
>then send it to our webpage (http://search.mmcommons.org/) as a json
>object. I think, I should explain my test more clearly. We use a
>webbrowser (Firefox or Chrome) to open the admin console of the search
>engine, which is at http://localhost:8983/solr/#/mmc_search3/query
><http://localhost:8983/solr/#/mmc_search3/query> on my local device.
>This is the default behavior. In this webbrowser, I use the query 
>"title:T%C3%BCbingen” in the field “g” with /select as the
>“Request-Handler (qt) <>”.This approach works like a charm (result wich
>echoParams attached). Also as asked by Rick, the request url displayed
>in the upper left is just perfect:
>http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python
><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
>The problems start to occur, when I click on this url:
>{
>  'responseHeader':{
>    'status':0,
>    'QTime':0,
>    'params':{
>      'q':u'title:T\u00fcbingen',
>      'echoParams':'all',
>      'wt':'python'}},
>  'response':{'numFound':0,'start':0,'docs':[]
>  }}
>So it seems internally, Solr is changing the request (or a used
>library?). I just don’t have any idea why. But I would like to get the
>more than 3 million results. I could as well just enter the above url
>into my browser and the url will be changed to
>http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python
><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
>and I get the same result (no found documents). So this is the problem.
>However, when I copy paste the url, it is still displaying the utf8
>encoding. I thing the “ü” in the url is just an improved layout by the
>browser.
>
>The confusion with the different solr comes from the fact, that I am
>continuously trying to improve my search index and make it more
>efficient. Hence I reindexed it several times, always to the latest
>version. The last reindexing occurred for Solr 7.0.1. having the
>indexing for Lucene 7.0.1. However, I performed the test also for other
>versions without any success.
>
>3) As Rick said: "With the Yahoo Flickr Creative Commons 100 Million
>(YFCC100m) dataset, a great novel dataset was introduced to the
>computer vision and multimedia research community." — cool
>
>My objective it to make it better usable, especially by providing
>different search modalities. The dataset consists of 99 Million images
>and 800k videos, but I am only working on the Flickr as well as
>generated metadata and try to add more and more metadata. The next big
>challenge is similarity search.
>
>4)
>http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:Tübingen&wt=python
><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>
>is displayed but it is
>http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python
><http://localhost:8983/solr/mmc_search3/select?echoParams=all&q=title:T%C3%BCbingen&wt=python>.
>
>5) I am searching for Tübingen. It is u-umlaut (LATIN SMALL LETTER U
>WITH DIAERESIS) as Rick said.
>
>6) I am just clicking on it in the admin solr standard interface. I
>could as well copy it into my webbrowser and open it. The result would
>be the same.
> <http://localhost:8983/solr/#/>
>
>7) As you can see in the result, the document seems to be indexed
>correctly, isn’t it? If we can’t figure anything out, I will try to
>reindex again but this will take a while because of the large amount of
>data and my limited compute power.
>
>8) Thanks for the hint with echoparams. The result is displayed above.
>
>9) As shown in the attached search result, there are actually results
>correctly indexed.
>
>10) The above example is now with Python.
>
>11) @Rick: Shall I change the /select handler? I do not quite
>understand the problem with it. But maybe as an explanation, my
>original config was probably based on solr4.x. I basically just updated
>the Lucene version and I had to replace/remove some parts because they
>were not supported anymore.
>
>12) For playing the ''what changed previous to it being broken” game, I
>am wondering if Solr (6.5 or 7.0.1) has any other dependencies other
>than Java. However, playing this game is quite difficult, because the
>human mind is not that good at it. We only tested once in a while, if
>requests with special symbols work and we mainly tested it only in the
>Gui without actually clicking on the resulting link that is displayed.
>Later we tested with the webpage, once and it was working. To figure
>out why it is not working anymore, we reduced the factors as much as
>possible and eventually arrived at the aforementioned test.
>{
>  'responseHeader':{
>    'status':0,
>    'QTime':131,
>    'params':{
>      'q':'title:T%C3%BCbingen',
>      'echoParams':'all',
>      'wt':'python',
>      '_':'1510024595963'}},
>  'response':{'numFound':3092484,'start':0,'docs':[
>      {
>        'photoid':'6182384834',
>        'hash':'7b201435fc5126accbfee6453b7fb181',
>        'userid':'48992104@N00',
>        'datetaken':'2011-09-04T13:19:16Z',
>        'dateuploaded':'2011-09-25T11:54:41Z',
>        'capturedevice':'NIKON COOLPIX S2500',
>        'title':'T%C3%BCbingen',
>        'longitude':9.055888,
>        'latitude':48.520157,
>        'accuracy':16,
>        'licensename':'Attribution-NonCommercial-ShareAlike License',
>        'marker':0,
>        'year':2011,
>        'yearmonth':201109,
>        'month':9,
>        'a_autotags':['city',
>          'nature',
>          'outdoor',
>          'cityscape',
>          'valley',
>          'landscape',
>          'architecture',
>          'canyon'],
>        'p_town':'\'Tuebingen\'',
>        'p_state':'\'Baden-Wurttemberg\'',
>        'p_country':'\'Germany\'',
>        'p_places':['\'Neckargasse\'',
>          '\'Tuebingen\'',
>          '\'Tubingen\'',
>          '\'Baden-Wurttemberg\'',
>          '\'72070\'',
>          '\'Germany\'',
>          '\'Europe%2FBerlin\''],
>        'usertags':['not_provided'],
>        'facet_usertags':['not_provided'],
>        'description':'not_provided',
>        'a_architecture':656,
>        'a_canyon':504,
>        'a_city':656,
>        'a_cityscape':575,
>        'a_landscape':542,
>        'a_nature':542,
>        'a_outdoor':924,
>        'a_valley':504,
>        '_version_':1581268421041979393},
>
>
>> On Nov 6, 2017, at 16:03, Chris Hostetter <hossman_lucene@fucit.org>
>wrote:
>> 
>> 
>> : We recently discovered issues with solr with converting utf8 code
>in the search. One or two month ago everything was still working.
>> : 
>> : - What might have caused it is a Java update (Java 8 Update 151). 
>> : - We are using firefox as well as chrome for displaying results.
>> : - We tested it with Solr 6.5, Solr 7.0.0, 7.0.1, and 7.1.
>> 
>> Just to be clear: in the 2 examples you provde below...
>> 
>> 1) which situation do you consider "correct" ? 
>>     ("match lots of docs" or "match no docs")
>> 2) are you testing those against the same live solr server?
>> 
>> I ask Q #2 because you mentioned "One or two month ago everything was
>
>> still working" ... but it's not clear what part of the "results"
>where 
>> different one of two months ago.
>> 
>> other things tha are unclear/confusing about your question...
>> 
>> : We created a search engine base on the yfcc100m and in the normal 
>> : browser (http://localhost:8983/solr/#/mmc_search3/query 
>> : <http://localhost:8983/solr/#/mmc_search3/query>), we can search
>for 
>> : "title:T%C3%BCbingen” in the query field and get more than 3
>million 
>> : results:
>> 
>> 3) what is "yfcc100m" ?
>> 4) what is the actual URL you see in your browser?
>> 5) what is the underlying byte sequence / character sequence you are 
>> trying to search for?
>> 
>> ie: can you please explicitly name the UNICODE codepoints you are 
>> intendeing to search for?
>> 
>> : However, when we use the respective web-address, 
>> :
>http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json
><http://localhost:8983/solr/mmc_search3/select?q=title:T%C3%BCbingen&wt=json>
>> 
>> 6) define "use the respective web-address" ?
>>    (how are you using it? what http client is hitting that url?)
>> 
>> 
>> Some general advice about debugging possible charst related issues:
>> 
>> * the problem may be related to how the query is executed -- or it
>may 
>> have been realted to how the data was originally indexed, if at that
>type 
>> the wrong byte sequences were sent.
>> 
>> * you can use things like "echoParams=all" in a query to see exactly
>what 
>> unicode characters solr is recieving in the q param
>> * assuming the field you are searching is stored=true, you can also
>send 
>> requests to search for one of the documents you expect by id, and
>verify 
>> what unicode characters were indexed.
>> * in both types of requests, you can use "wt=python" to help see the 
>> underlying bytes being returned for each character (the python
>response 
>> writer escapes all characters outside of the ascii range)
>> 
>> 
>> 
>> -Hoss
>> http://www.lucidworks.com/

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message