Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45F30EB96 for ; Sun, 24 Feb 2013 21:17:05 +0000 (UTC) Received: (qmail 43284 invoked by uid 500); 24 Feb 2013 21:17:01 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 43159 invoked by uid 500); 24 Feb 2013 21:17:01 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 43151 invoked by uid 99); 24 Feb 2013 21:17:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Feb 2013 21:17:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.215.43] (HELO mail-la0-f43.google.com) (209.85.215.43) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Feb 2013 21:16:55 +0000 Received: by mail-la0-f43.google.com with SMTP id ek20so2094602lab.16 for ; Sun, 24 Feb 2013 13:16:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:content-transfer-encoding :x-gm-message-state; bh=GeCtFSKpvDdWiK87j6WtJF86G/IBz7Q29iz07oY92zY=; b=M52guVNcMSEdBAQ00oEoieV2+ianL6tUn99I6caUSdP2vKJWJdehzgX48a4SlmUsbb F8sk21xj9VVWGE8CUv3uiJNvQSWDJDktodDO2xyJoss+IV8A1fHkQNkacavWqh27l+yu C2NVei4HX+3Zd49FdfpWSSV2RHtOCOTnzlGIrfdR+4kPdWzmXcLPwrlZXLBnQcsVPwOV dsEjfxTygraSeL1EJMOrX8qER7q+h2R59G4+NNn+NihtYlpZ9EgUQtnn4jRCgIMoDM01 Cet9En820B9aDfTEh1HQAB3NKFt4ubJD2BuLqduxBgWDxYoovvdOjNjyhz8x1W1YoFnk /gBw== X-Received: by 10.112.100.41 with SMTP id ev9mr3661866lbb.34.1361740593114; Sun, 24 Feb 2013 13:16:33 -0800 (PST) MIME-Version: 1.0 Received: by 10.152.28.71 with HTTP; Sun, 24 Feb 2013 13:16:12 -0800 (PST) In-Reply-To: References: From: Michael Della Bitta Date: Sun, 24 Feb 2013 16:16:12 -0500 Message-ID: Subject: Re: Interesting issue with "special characters" in a string field value To: solr-user@lucene.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQnmL4g4gIiDb2K1fGzllXthxHFZGWa5ahFEG2TIVUtov99LnMagnt0kpflLbbqgZs4/Ptk8 X-Virus-Checked: Checked by ClamAV on apache.org Hello Jack, I'm not sure if this is an option for you, but if you submit and retrieve your documents using only SolrJ, you won't have to worry about escaping them for encoding into a particular document format. SolrJ would handle that for you. Michael Della Bitta ------------------------------------------------ Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn=92t a Game On Sun, Feb 24, 2013 at 12:29 AM, Jack Park wrot= e: > Ok. I have revisited this issue as deeply as possible using simplistic > unit tests, tossing out indexes, and starting fresh. > > A typical Solr document might have a label, e.g. the string inside the > quotes: "Node Type". That would be queried, according to what I've > been able to read, as a Phrase Query, which means, include the quotes > around the text. > > When I use the admin query panel with this query: > label:"Node Type" > A fragment of the full document is returned. it is this: > > > NodeType > > Node Type > > > In my code using SolrJ, I have printlines just as the "escaped" query > string comes in, and one which shows what the SolrQuery looks like > after setting it up to go online. I then show what came back: > > Solr3Client.runQuery- label:"Node Type" 0 10 > Solr3Client.runQuery-1 q=3Dlabel%3A%22Node+Type%22&start=3D0&rows=3D10 > ZZZZ {numFound=3D1,start=3D0,docs=3D[SolrDocument{locator=3DNodeType, > smallIcon=3Dcogwheel.png, subOf=3DClassType, details=3DThe TopicQuests > typology node type., isPrivate=3Dfalse, creatorId=3DSystemUser, label=3DN= ode > Type, largeIcon=3Dcogwheel.png, lastEditDate=3DSat Feb 23 20:43:22 PST > 2013, createdDate=3DSat Feb 23 20:43:22 PST 2013, > _version_=3D1427826019119661056}]} > > What that says is that SolrQuery inserted a + inside the query string, > and that it found 1 document, but did not return it. > > In the largest picture, I have returned to using XMLResponseParser on > the theory that I will now be able to take advantage of partialUpdates > on multi-valued fields (List) but haven't tested that yet. I > am not yet escaping such things as "<" or ">" but just escaping those > things mentioned in the Solr documents which are reserved characters. > > So, the current update is this: learning about phrase queries, and > judicious escaping of reserved characters seems to be helping. Next up > entails two issues: more robust testing of escaped characters, and > trying to discover what is the best approach to dealing with > characters that must be escaped to get past XML, e.g. '<', '>', and > others. > > Many thanks > Jack > > > On Fri, Feb 22, 2013 at 2:44 PM, Jack Park wro= te: >> Michael, >> I don't think you misunderstood. I will soon give a full response here, = but >> am on the road at the moment. >> >> Many thanks >> Jack >> >> >> On Friday, February 22, 2013, Michael Della Bitta >> wrote: >>> My mistake, I misunderstood the problem. >>> >>> Michael Della Bitta >>> >>> ------------------------------------------------ >>> Appinions >>> 18 East 41st Street, 2nd Floor >>> New York, NY 10017-6271 >>> >>> www.appinions.com >>> >>> Where Influence Isn=92t a Game >>> >>> >>> On Fri, Feb 22, 2013 at 3:55 PM, Chris Hostetter >>> wrote: >>>> >>>> : If you're submitting documents as XML, you're always going to have t= o >>>> : escape meaningful XML characters going in. If you ask for them back = as >>>> : XML, you should be prepared to unescape special XML characters as >>>> >>>> that still wouldn't explain the discrepency he's claiming to see betwe= en >>>> the json & xml resmonses (the json containing an empty string >>>> >>>> Jack: please elaborate with specifics about your solr version, field, >>>> field type, how you indexed your doc, and what the request urls & raw >>>> responses that you get are (ie: don't trust the XML you see in your >>>> browser, it may be unescaping escaped sequences in element text to be >>>> "helpful" .. use something like curl) >>>> >>>> For example... >>>> >>>> ----BEGIN GOOD EXAMPLE OF SPECIFICS--- >>>> >>>> I'm using Solr 4.x with the 4.x example schema which has the following >>>> field... >>>> >>>> >>> multiValued=3D"true"/> >>>> >>> /> >>>> >>>> I indexed a doc like this... >>>> >>>> $ curl "http://localhost:8983/solr/update?commit=3Dtrue" -H >>>> 'Content-type:application/json' -d '[{"id":"hoss", "cat":">>> as a source node>" } ]' >>>> >>>> And this is what i get from the following requests... >>>> >>>> $ curl >>>> "http://localhost:8983/solr/select?q=3Did:hoss&wt=3Dxml&indent=3Dtrue&= omitHeader=3Dtrue" >>>> >>>> >>>> >>>> >>>> >>>> hoss >>>> >>>> <Something to use as a source node> >>>> >>>> 1427705631375097856 >>>> >>>> >>>> >>>> $ curl >>>> "http://localhost:8983/solr/select?q=3Did:hoss&wt=3Djson&indent=3Dtrue= &omitHeader=3Dtrue" >>>> { >>>> "response":{"numFound":1,"start":0,"docs":[ >>>> { >>>> "id":"hoss", >>>> "cat":[""], >>>> "_version_":1427705631375097856}] >>>> }} >>>> >>>> $ curl >>>> "http://localhost:8983/solr/select?q=3Dcat:%22%22&wt=3Djson&indent=3Dtrue&omitHeader=3Dtrue" >>>> { >>>> "response":{"numFound":1,"start":0,"docs":[ >>>> { >>>> "id":"hoss", >>>> "cat":[""], >>>> "_version_":1427705631375097856}] >>>> }} >>>> >>>> ----END GOOD EXAMPLE OF SPECIFICS--- >>>> >>>> : > Even more curious, if I use this query at the console: >>>> : > >>>> : > details: >>>> : > >>>> : > I get nothing back. >>>> >>>> note in my last example above the importance of using quotes (or the >>>> {!term} qparser) to query string fields that contain special character= s >>>> like whitespace -- whitespace is syntacally meaningul to the lucene qu= ery >>>> parser, it seperates clauses of a boolean query. >>>> >>>> >>>> -Hoss >>>