lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Arabic words search in solr
Date Thu, 09 Mar 2017 19:14:29 GMT
Hi Mohan,

Your examples refer to documents I don’t have in my 9 document set, so I recast the problem
to a query/doc combo I have from earlier in this thread, and I was able to restrict hits to
only documents that contained all terms from the query.

If I use the query “name_ar:(شرطة ازكي)” I get 3 hits (I’ve left out some details):

-----
{ "responseHeader": { ... "params": { "q":"name_ar:(شرطة ازكي)”, ... } },
  "response": { "numFound":3, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة
الداخلية - - مركز شرطة إزكي"], ... },
      { "id":"3", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة
شمال الشرقية - - مركز شرطة إبراء”], ... },
      { "id":"8", "name_ar":["وزارة الصحة - المديرية العامة للخدمات
الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  -
الطوارئ”], ... }]}
-----

If I add “q.op=AND” to the request, only one of these documents matches - note that I’ve
also checked the “debugQuery” option on the Admin UI:

-----
{ "responseHeader": { … 
  "params": { "q":"name_ar:(شرطة ازكي)”, "q.op":"AND”, "debugQuery":“true”,
... } },
  "response": { "numFound":1, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة
الداخلية - - مركز شرطة إزكي”], ... }]},
  "debug": {
    "rawquerystring": "name_ar:(شرطة ازكي)",
    "querystring": "name_ar:(شرطة ازكي)",
    "parsedquery": "+name_ar:شرط +name_ar:ازك",
    "parsedquery_toString": "+name_ar:شرط +name_ar:ازك",
-----

Note the “parsedquery" above - it shows how to require individual terms when specifying
the field for each term.  This is how the "name_ar:(شرطة ازكي)” query is interpreted
when the "q.op=AND” request param is used.

The equivalent query using ‘+’ signs is: "name_ar:(+شرطة +ازكي)”.  This *looks*
strange because of how the Unicode bidirectional algorithm works.  This W3C writeup uses Arabic
to drive its discussion of display of strings that contain both RTL and LTR character runs,
and I found it quite helpful here: <https://www.w3.org/International/articles/inline-bidi-markup/uba-basics>.

Here’s the output from the "name_ar:(+شرطة +ازكي)” query:

-----
{ "responseHeader": { ... "params": { "q":"name_ar:(+شرطة +ازكي)", "debugQuery":“true”
... } },
  "response": { "numFound":1, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة
الداخلية - - مركز شرطة إزكي”], ... }]},
  "debug": {
    "rawquerystring": "name_ar:(+شرطة +ازكي)",
    "querystring": "name_ar:(+شرطة +ازكي)",
    "parsedquery": "+name_ar:شرط +name_ar:ازك",
    "parsedquery_toString": "+name_ar:شرط +name_ar:ازك",
-----

The above is the same result (and has the same parsedQuery) as query "name_ar:(شرطة ازكي)”
with request param “q.op=AND”.

I won’t show it here, but I get the same 1-hit result for this query when I use AND instead
of ‘+’: "name_ar:(شرطة AND ازكي)” - note that the terms only *appear* to be
in reverse order because of how the Unicode bidirectional algorithm works.

> On Mar 9, 2017, at 2:30 AM, mohanmca01 <mohanmca01@gmail.com> wrote:
> 
> I saw your products in lucidworks website. Do you have any solr arabic
> support customized product?

Lucidworks doesn’t have a specifically Arabic-focused product, but we have helped people
enable Arabic search in the past.  Click on the “Contact Us” link on the website if you’d
like to talk to us about getting involved.

--
Steve
www.lucidworks.com


Mime
View raw message