lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Park <jackp...@topicquests.org>
Subject Re: Interesting issue with "special characters" in a string field value
Date Sun, 24 Feb 2013 05:29:11 GMT
Ok. I have revisited this issue as deeply as possible using simplistic
unit tests, tossing out indexes, and starting fresh.

A typical Solr document might have a label, e.g. the string inside the
quotes: "Node Type".  That would be queried, according to what I've
been able to read, as a Phrase Query, which means, include the quotes
around the text.

When I use the admin query panel with this query:
label:"Node Type"
A fragment of the full document is returned. it is this:

  <doc>
    <str name="locator">NodeType</str>
    <arr name="label">
      <str>Node Type</str>
    </arr>

In my code using SolrJ, I have printlines just as the "escaped" query
string comes in, and one which shows what the SolrQuery looks like
after setting it up to go online. I then show what came back:

Solr3Client.runQuery- label:"Node Type" 0 10
Solr3Client.runQuery-1 q=label%3A%22Node+Type%22&start=0&rows=10
ZZZZ {numFound=1,start=0,docs=[SolrDocument{locator=NodeType,
smallIcon=cogwheel.png, subOf=ClassType, details=The TopicQuests
typology node type., isPrivate=false, creatorId=SystemUser, label=Node
Type, largeIcon=cogwheel.png, lastEditDate=Sat Feb 23 20:43:22 PST
2013, createdDate=Sat Feb 23 20:43:22 PST 2013,
_version_=1427826019119661056}]}

What that says is that SolrQuery inserted a + inside the query string,
and that it found 1 document, but did not return it.

In the largest picture, I have returned to using XMLResponseParser on
the theory that I will now be able to take advantage of partialUpdates
on multi-valued fields (List<String>) but haven't tested that yet. I
am not yet escaping such things as "<" or ">" but just escaping those
things mentioned in the Solr documents which are reserved characters.

So, the current update is this: learning about phrase queries, and
judicious escaping of reserved characters seems to be helping. Next up
entails two issues: more robust testing of escaped characters, and
trying to discover what is the best approach to dealing with
characters that must be escaped to get past XML, e.g. '<', '>', and
others.

Many thanks
Jack


On Fri, Feb 22, 2013 at 2:44 PM, Jack Park <jackpark@topicquests.org> wrote:
> Michael,
> I don't think you misunderstood. I will soon give a full response here, but
> am on the road at the moment.
>
> Many thanks
> Jack
>
>
> On Friday, February 22, 2013, Michael Della Bitta
> <michael.della.bitta@appinions.com> wrote:
>> My mistake, I misunderstood the problem.
>>
>> Michael Della Bitta
>>
>> ------------------------------------------------
>> Appinions
>> 18 East 41st Street, 2nd Floor
>> New York, NY 10017-6271
>>
>> www.appinions.com
>>
>> Where Influence Isn’t a Game
>>
>>
>> On Fri, Feb 22, 2013 at 3:55 PM, Chris Hostetter
>> <hossman_lucene@fucit.org> wrote:
>>>
>>> : If you're submitting documents as XML, you're always going to have to
>>> : escape meaningful XML characters going in. If you ask for them back as
>>> : XML, you should be prepared to unescape special XML characters as
>>>
>>> that still wouldn't explain the discrepency he's claiming to see between
>>> the json & xml resmonses (the json containing an empty string
>>>
>>> Jack: please elaborate with specifics about your solr version, field,
>>> field type, how you indexed your doc, and what the request urls & raw
>>> responses that you get are (ie: don't trust the XML you see in your
>>> browser, it may be unescaping escaped sequences in element text to be
>>> "helpful" .. use something like curl)
>>>
>>> For example...
>>>
>>> ----BEGIN GOOD EXAMPLE OF SPECIFICS---
>>>
>>> I'm using Solr 4.x with the 4.x example schema which has the following
>>> field...
>>>
>>>    <field name="cat" type="string" indexed="true" stored="true"
>>> multiValued="true"/>
>>>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>>> />
>>>
>>> I indexed a doc like this...
>>>
>>> $ curl "http://localhost:8983/solr/update?commit=true" -H
>>> 'Content-type:application/json' -d '[{"id":"hoss", "cat":"<Something to use
>>> as a source node>" } ]'
>>>
>>> And this is what i get from the following requests...
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=id:hoss&wt=xml&indent=true&omitHeader=true"
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <response>
>>>
>>> <result name="response" numFound="1" start="0">
>>>   <doc>
>>>     <str name="id">hoss</str>
>>>     <arr name="cat">
>>>       <str>&lt;Something to use as a source node&gt;</str>
>>>     </arr>
>>>     <long name="_version_">1427705631375097856</long></doc>
>>> </result>
>>> </response>
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=id:hoss&wt=json&indent=true&omitHeader=true"
>>> {
>>>   "response":{"numFound":1,"start":0,"docs":[
>>>       {
>>>         "id":"hoss",
>>>         "cat":["<Something to use as a source node>"],
>>>         "_version_":1427705631375097856}]
>>>   }}
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=cat:%22<Something+to+use+as+a+source+node>%22&wt=json&indent=true&omitHeader=true"
>>> {
>>>   "response":{"numFound":1,"start":0,"docs":[
>>>       {
>>>         "id":"hoss",
>>>         "cat":["<Something to use as a source node>"],
>>>         "_version_":1427705631375097856}]
>>>   }}
>>>
>>> ----END GOOD EXAMPLE OF SPECIFICS---
>>>
>>> : > Even more curious, if I use this query at the console:
>>> : >
>>> : > details:<Something to use as a source node>
>>> : >
>>> : > I get nothing back.
>>>
>>> note in my last example above the importance of using quotes (or the
>>> {!term} qparser) to query string fields that contain special characters
>>> like whitespace -- whitespace is syntacally meaningul to the lucene query
>>> parser, it seperates clauses of a boolean query.
>>>
>>>
>>> -Hoss
>>

Mime
View raw message