Hello

This is a bit of a shoot in blind, but while I try to see how I can investigate further, I thought that I would try to see if we could be lucky to hit someone who had experienced a similar issue as we are facing right now.

 

First a little bit of back ground.
We use Lucene.Net 3.0.3 to index json documents, each json field gets translated into a fieldname as you would access that field on the document, so { obj: { fieldName: “42kittens” } } would be translated into “obj.fieldName” = “42kittens” etc. Depending on the datatype from json, each field is indexed differently but right now we can focus on “text fields” as that is where our issue is atm.

 

We use a StandardAnalyzer with an empty stopset and the query parser is a slightly modified version of the MultiFieldQueryParser allowing for using “*” in range queries as well as having a dynamic fields set depending on what has been indexed. (We keep automatically track of all possible fields in the system)

 

We currently have about ~500.000 documents in our index, each document ranges from ~10 fields to thousands of fields (each field may be represented multiple times because of arrays), this results in about a 4GB index.

 

All in all everything seemed to work just fine, however yesterday we discovered that we had some issues using wildcards.

 

We have some documents which represents ports all over the world, these have what is called a locode, a locode is always 5 characters, e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get more here: http://locode.info (It might not be an exhausted list)…

Now if I search for “locode: MA*” I get:

 

-      MA888

-      MA6KN

 

However if I search for “locode: MAAGA” I get:

 

-      MAAGA

 

But that should have been included in the search above it as MA* clearly should match MAAGA.

 

If I search for “locode: (MA* OR MAAGA)” I get:

 

-      MA888

-      MA6KN

-      MAAGA


Now if I search for “locode: MAA*” I now get:

-      MAAHU

-      MAAZE

-      MAANZ

-      MAASI

-      MAAGA

 

Which all should be part of the first result right?...

 

So I am thinking that there is something I am missing here…

Med venlig hilsen / Kind regards

Systematic Logo
Jens Melgaard
System Architect

Søren Frichs Vej 39, 8000 Aarhus C
Denmark

Mobile:
+45 4196 5119
Jens.Melgaard@systematic.com
www.systematic.com

Seasons greetings from systematic