lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Della Bitta <michael.della.bi...@appinions.com>
Subject Re: Interesting issue with "special characters" in a string field value
Date Sun, 24 Feb 2013 21:16:12 GMT
Hello Jack,

I'm not sure if this is an option for you, but if you submit and
retrieve your documents using only SolrJ, you won't have to worry
about escaping them for encoding into a particular document format.
SolrJ would handle that for you.

Michael Della Bitta

------------------------------------------------
Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 12:29 AM, Jack Park <jackpark@topicquests.org> wrote:
> Ok. I have revisited this issue as deeply as possible using simplistic
> unit tests, tossing out indexes, and starting fresh.
>
> A typical Solr document might have a label, e.g. the string inside the
> quotes: "Node Type".  That would be queried, according to what I've
> been able to read, as a Phrase Query, which means, include the quotes
> around the text.
>
> When I use the admin query panel with this query:
> label:"Node Type"
> A fragment of the full document is returned. it is this:
>
>   <doc>
>     <str name="locator">NodeType</str>
>     <arr name="label">
>       <str>Node Type</str>
>     </arr>
>
> In my code using SolrJ, I have printlines just as the "escaped" query
> string comes in, and one which shows what the SolrQuery looks like
> after setting it up to go online. I then show what came back:
>
> Solr3Client.runQuery- label:"Node Type" 0 10
> Solr3Client.runQuery-1 q=label%3A%22Node+Type%22&start=0&rows=10
> ZZZZ {numFound=1,start=0,docs=[SolrDocument{locator=NodeType,
> smallIcon=cogwheel.png, subOf=ClassType, details=The TopicQuests
> typology node type., isPrivate=false, creatorId=SystemUser, label=Node
> Type, largeIcon=cogwheel.png, lastEditDate=Sat Feb 23 20:43:22 PST
> 2013, createdDate=Sat Feb 23 20:43:22 PST 2013,
> _version_=1427826019119661056}]}
>
> What that says is that SolrQuery inserted a + inside the query string,
> and that it found 1 document, but did not return it.
>
> In the largest picture, I have returned to using XMLResponseParser on
> the theory that I will now be able to take advantage of partialUpdates
> on multi-valued fields (List<String>) but haven't tested that yet. I
> am not yet escaping such things as "<" or ">" but just escaping those
> things mentioned in the Solr documents which are reserved characters.
>
> So, the current update is this: learning about phrase queries, and
> judicious escaping of reserved characters seems to be helping. Next up
> entails two issues: more robust testing of escaped characters, and
> trying to discover what is the best approach to dealing with
> characters that must be escaped to get past XML, e.g. '<', '>', and
> others.
>
> Many thanks
> Jack
>
>
> On Fri, Feb 22, 2013 at 2:44 PM, Jack Park <jackpark@topicquests.org> wrote:
>> Michael,
>> I don't think you misunderstood. I will soon give a full response here, but
>> am on the road at the moment.
>>
>> Many thanks
>> Jack
>>
>>
>> On Friday, February 22, 2013, Michael Della Bitta
>> <michael.della.bitta@appinions.com> wrote:
>>> My mistake, I misunderstood the problem.
>>>
>>> Michael Della Bitta
>>>
>>> ------------------------------------------------
>>> Appinions
>>> 18 East 41st Street, 2nd Floor
>>> New York, NY 10017-6271
>>>
>>> www.appinions.com
>>>
>>> Where Influence Isn’t a Game
>>>
>>>
>>> On Fri, Feb 22, 2013 at 3:55 PM, Chris Hostetter
>>> <hossman_lucene@fucit.org> wrote:
>>>>
>>>> : If you're submitting documents as XML, you're always going to have to
>>>> : escape meaningful XML characters going in. If you ask for them back as
>>>> : XML, you should be prepared to unescape special XML characters as
>>>>
>>>> that still wouldn't explain the discrepency he's claiming to see between
>>>> the json & xml resmonses (the json containing an empty string
>>>>
>>>> Jack: please elaborate with specifics about your solr version, field,
>>>> field type, how you indexed your doc, and what the request urls & raw
>>>> responses that you get are (ie: don't trust the XML you see in your
>>>> browser, it may be unescaping escaped sequences in element text to be
>>>> "helpful" .. use something like curl)
>>>>
>>>> For example...
>>>>
>>>> ----BEGIN GOOD EXAMPLE OF SPECIFICS---
>>>>
>>>> I'm using Solr 4.x with the 4.x example schema which has the following
>>>> field...
>>>>
>>>>    <field name="cat" type="string" indexed="true" stored="true"
>>>> multiValued="true"/>
>>>>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>>>> />
>>>>
>>>> I indexed a doc like this...
>>>>
>>>> $ curl "http://localhost:8983/solr/update?commit=true" -H
>>>> 'Content-type:application/json' -d '[{"id":"hoss", "cat":"<Something to
use
>>>> as a source node>" } ]'
>>>>
>>>> And this is what i get from the following requests...
>>>>
>>>> $ curl
>>>> "http://localhost:8983/solr/select?q=id:hoss&wt=xml&indent=true&omitHeader=true"
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <response>
>>>>
>>>> <result name="response" numFound="1" start="0">
>>>>   <doc>
>>>>     <str name="id">hoss</str>
>>>>     <arr name="cat">
>>>>       <str>&lt;Something to use as a source node&gt;</str>
>>>>     </arr>
>>>>     <long name="_version_">1427705631375097856</long></doc>
>>>> </result>
>>>> </response>
>>>>
>>>> $ curl
>>>> "http://localhost:8983/solr/select?q=id:hoss&wt=json&indent=true&omitHeader=true"
>>>> {
>>>>   "response":{"numFound":1,"start":0,"docs":[
>>>>       {
>>>>         "id":"hoss",
>>>>         "cat":["<Something to use as a source node>"],
>>>>         "_version_":1427705631375097856}]
>>>>   }}
>>>>
>>>> $ curl
>>>> "http://localhost:8983/solr/select?q=cat:%22<Something+to+use+as+a+source+node>%22&wt=json&indent=true&omitHeader=true"
>>>> {
>>>>   "response":{"numFound":1,"start":0,"docs":[
>>>>       {
>>>>         "id":"hoss",
>>>>         "cat":["<Something to use as a source node>"],
>>>>         "_version_":1427705631375097856}]
>>>>   }}
>>>>
>>>> ----END GOOD EXAMPLE OF SPECIFICS---
>>>>
>>>> : > Even more curious, if I use this query at the console:
>>>> : >
>>>> : > details:<Something to use as a source node>
>>>> : >
>>>> : > I get nothing back.
>>>>
>>>> note in my last example above the importance of using quotes (or the
>>>> {!term} qparser) to query string fields that contain special characters
>>>> like whitespace -- whitespace is syntacally meaningul to the lucene query
>>>> parser, it seperates clauses of a boolean query.
>>>>
>>>>
>>>> -Hoss
>>>

Mime
View raw message