lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tegelberg, Allan" <allan.tegelb...@boeing.com>
Subject solr parsed query dropping special chars
Date Thu, 24 Jan 2013 23:29:26 GMT
This post has non-ascii chars in it and might not look correct by the time you see it.
Issue: The indexed docs have embedded html numeric entities for these chars
∠ ψ Σ • ≤ ≠ • ≥ μ ω φ θ ¢ β √ Ω ° ± Δ #
When I search for these characters in the admin query, I can only find the Greeks.
debug shows the parsed query only has greek chars like omega, delta, sigma
but does not contain others like degree, angle, cent, bullet, less_equal…
 <lst name="debug">
  <str name="rawquerystring">∠  ψ Σ • ≤ ≠ • ≥ μ ω φ θ ¢ β √ Ω
° ± Δ #</str>
  <str name="querystring">∠  ψ Σ • ≤ ≠ • ≥ μ ω φ θ ¢ β √ Ω °
± Δ #</str>
 <str name="parsedquery">text:ψ text:σ text:μ text:ω text:φ text:θ text:β text:ω
text:δ</str>
 <str name="parsedquery_toString">text:ψ text:σ text:μ text:ω text:φ text:θ text:β
text:ω text:δ</str>

How can I configure solr to find the all of the entities?

Doc example

<doc>
<field name="id">symbols</field>
<field name="text">angle (&#8736;)</field>
<field name="text">upper_omega (&#937;)</field>
<field name="text">degree (&#176;)</field>
<field name="text">lower_psi (&#968;)</field>
<field name="text">plus_minus (&#177;)</field>
<field name="text">upper_sigma (&#931;)</field>
<field name="text">upper_delta (&#916;)</field>
<field name="text">mid_dot (&#183;)</field>
<field name="text">less_equal (&#8804;)</field>
<field name="text">pound (#)</field>
<field name="text">lower_omega (&#969;)</field>
<field name="text">lower_mu (&#956;)</field>
<field name="text">greater_equal (&#8805;)</field>
<field name="text">medium_bullet (&#8226;)</field>
<field name="text">not_equal (&#8800;)</field>
<field name="text">greater_than (&#62;)</field>
<field name="text">lower_phi (&#966;)</field>
<field name="text">lower_theta (&#952;)</field>
<field name="text">cent (&#162;)</field>
<field name="text">lower_beta (&#946;)</field>
<field name="text">square_root (&#8730;)</field>
</doc>

Query   ∠ ψ Σ • ≤ ≠ • ≥ μ ω φ θ ¢ β √ Ω ° ± Δ #

URL w/highlighting, the URL codes match the numeric entities)
http://xxx:8983/solr/select?indent=on&version=2.2
&q=%E2%88%A0+%CF%88+%CE%A3+%E2%80%A2+%E2%89%A4+%E2%89%A0+%E2%80%A2+%E2%89%A5+%CE%BC+%CF%89+%CF%86+%CE%B8+%C2%A2+%CE%B2+%E2%88%9A+%CE%A9+%C2%B0+%C2%B1+%CE%94+%23<http://xxx:8983/solr/select?indent=on&version=2.2%0b&q=%E2%88%A0+%CF%88+%CE%A3+%E2%80%A2+%E2%89%A4+%E2%89%A0+%E2%80%A2+%E2%89%A5+%CE%BC+%CF%89+%CF%86+%CE%B8+%C2%A2+%CE%B2+%E2%88%9A+%CE%A9+%C2%B0+%C2%B1+%CE%94+%23>
&fq=&start=0&rows=10&fl=text&qt=standard&wt=standard&debugQuery=on&explainOther=
&hl=on&hl.snippets=20&hl.fragsize=5000&hl.fl=text

the response dumps the document and  shows me the chars exist in the document..
<str>angle (∠)</str>
<str>upper_omega (Ω)</str>
<str>degree (°)</str>
<str>lower_psi (ψ)</str>
<str>plus_minus (±)</str>
<str>upper_sigma (Σ)</str>
<str>upper_delta (Δ)</str>
<str>mid_dot (·)</str>
<str>less_equal (≤)</str>
<str>pound (#)</str>
<str>lower_omega (ω)</str>
<str>percent (%)</str>
<str>lower_mu (μ)</str>
<str>apostrophe (')</str>
<str>greater_equal (≥)</str>
<str>less_than (<)</str>
<str>medium_bullet (•)</str>
<str>not_equal (≠)</str>
<str>greater_than (>)</str>
<str>lower_phi (φ)</str>
< str>and (&)</str>
 str>lower_theta (θ)</str>
<str>qoute (")</str>
<str>cent (¢)</str>
<str>lower_beta (β)</str>
<str>square_root (√)</str>

and highlight only shows the Greeks, if I only search for degree, I would not get any hits.
<lst name="highlighting">
   <lst name="symbols">
        <arr name="text">
   <str>upper_omega (<em>Ω</em>)</str>
   <str>lower_psi (<em>ψ</em>)</str>
   <str>upper_sigma (<em>Σ</em>)</str>
   <str>upper_delta (<em>Δ</em>)</str>
   <str>lower_omega (<em>ω</em>)</str>
   <str>lower_mu (<em>μ</em>)</str>
   <str>lower_phi (<em>φ</em>)</str>
   <str>lower_theta (<em>θ</em>)</str>
   <str>lower_beta (<em>β</em>)</str>
</arr>
</lst>


Any help highly appreciated
Allan Tegelberg
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message