lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Melgaard <Jens.Melga...@Systematic.com>
Subject Problems with Wildcard searches.
Date Thu, 21 Dec 2017 10:16:03 GMT
Hello

This is a bit of a shoot in blind, but while I try to see how I can investigate further, I
thought that I would try to see if we could be lucky to hit someone who had experienced a
similar issue as we are facing right now.

First a little bit of back ground.
We use Lucene.Net 3.0.3 to index json documents, each json field gets translated into a fieldname
as you would access that field on the document, so { obj: { fieldName: "42kittens" } } would
be translated into "obj.fieldName" = "42kittens" etc. Depending on the datatype from json,
each field is indexed differently but right now we can focus on "text fields" as that is where
our issue is atm.

We use a StandardAnalyzer with an empty stopset and the query parser is a slightly modified
version of the MultiFieldQueryParser allowing for using "*" in range queries as well as having
a dynamic fields set depending on what has been indexed. (We keep automatically track of all
possible fields in the system)

We currently have about ~500.000 documents in our index, each document ranges from ~10 fields
to thousands of fields (each field may be represented multiple times because of arrays), this
results in about a 4GB index.

All in all everything seemed to work just fine, however yesterday we discovered that we had
some issues using wildcards.

We have some documents which represents ports all over the world, these have what is called
a locode, a locode is always 5 characters, e.g. DKAAR, VIFRD, ITPVT etc... The first 2 letters
represent the country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You
can get more here: http://locode.info (It might not be an exhausted list)...

Now if I search for "locode: MA*" I get:


-      MA888

-      MA6KN

However if I search for "locode: MAAGA" I get:


-      MAAGA

But that should have been included in the search above it as MA* clearly should match MAAGA.

If I search for "locode: (MA* OR MAAGA)" I get:


-      MA888

-      MA6KN

-      MAAGA

Now if I search for "locode: MAA*" I now get:


-      MAAHU

-      MAAZE

-      MAANZ

-      MAASI

-      MAAGA

Which all should be part of the first result right?...

So I am thinking that there is something I am missing here...
Med venlig hilsen / Kind regards

[Systematic Logo]<http://www.systematic.com/>
Jens Melgaard
System Architect

Søren Frichs Vej 39, 8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com<mailto:Jens.Melgaard@systematic.com>
www.systematic.com<http://www.systematic.com>
[Seasons greetings from systematic]<http://systematic.com/>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message