lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Melgaard <Jens.Melga...@Systematic.com>
Subject RE: Problems with Wildcard searches.
Date Sat, 30 Dec 2017 12:21:47 GMT
Cheers

That might be the right solution for us, for the time being we have adjusted the system to
run under en-gb/en by setting it in the web.config file (We can't set it to Invariant that
way. En-gb is near the same AFAIK, but regardless, it appears to work with that culture as
well), it's running in virtual environments that only really runes that solution (It does
run KUDU for deployment, but that’s behind the scene so that’s not a big issue)...

All in all since we have an international audience, running under da-dk is odd anyways.

Again, thanks for the help... It is very much appreciated, we would never have solved it this
quickly without it!... 


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: 22. december 2017 17:33
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Jens,

Setting CultureInfo.DefaultThreadCurrentCulture applies to all threads (which is probably
not what you want especially if you are using ASP.NET).

There is a way that is less invasive. Since I can assume you are on .NET Framework (because
that is all that Lucene.NET 3.0.3 supports):

System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture;

This only applies to the current thread. You can store the current culture in a variable before
this operation and then restore it after the operation is complete. There is a standalone
CultureContext class here (https://github.com/apache/lucenenet/blob/a3a12967b250e8e7e5f623f0ba7572ec64f479ac/src/Lucene.Net/Support/CultureContext.cs)
that wraps this operation up so you can use a using block to ensure the culture is properly
restored.

// Your application code...

using (var invariantContext  = new CultureContext(CultureInfo.InvariantCulture))
{
    // Lucene.NET query...
    
    // Optional block to temporarily restore the original culture
    using (var originalContext = new CultureContext(invariantContext.OriginalCulture))
    {
        // Your application code...
    }
    
    // Lucene.NET query...
}

// Your application code...

Hope this helps.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:07 PM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Shad

Cheers for the input!...

With that input I think I am indeed able to reproduce the issue by forcing the culture to
be DA-DK for an application like so:

CultureInfo.DefaultThreadCurrentCulture = CultureInfo.GetCultureInfo("DA-DK");

And then indexing: locode=

- MASFI
- MA888
- MA6KN
- MASUR
- MAANO
- MAAHR
- DKAAR
- DKKBH

Search: "locode: MA*"; 2 hits:
- MA888
- MA6KN

Search: "locode: MAA*"; 2 hits: 
- MAANO
- MAAHR

Etc... So that really seems to be the issue. With that knowledge I can rationalize about why
MA* does not yeild the locodes that start with MAA as AA in danish is old danish for Å and
would be ordered after Z

I can't quite rationalize the MAS-- though, but that would just be for curiosity anyways.

Anyways, besides changing the OS culture, setting the CultureInfo.DefaultThreadCurrentCulture
or making a modified version of Lucene 3.0.3 where we explicitly set the culture in all places,
are there any solutions that is less invasive?...

From your describtion and my own prior knowledge of Lucene.NET, my guess is no, but I wanted
to make sure.

Anyways, thanks again!...

Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: 21. december 2017 22:12
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hi Jens,

This reminds me a little of some of the bugs I tracked down before in Lucene.NET 4.8.0.

One of the issues was due to the fact that the SortedSet<string>/SortedDictionary<string,
TValue> in Java is culture-insensitive, so when they are using string as the key, the results
were sorted in the wrong order in .NET. So, all of the SortedSet<string> and SortedDictionary<string,
TValue> were updated to use a StringComparer.Ordinal comparer to ensure the results are
in the same order as in Java. Sometimes the result is dependent upon the items being in the
proper sequence, and if not, the results are cut short.

You might want to try doing the search in the invariant culture to see if you get better results.
Might not be the issue, but it is a pretty quick theory to test.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Jens Melgaard [mailto:Jens.Melgaard@Systematic.com]
Sent: Friday, December 22, 2017 4:00 AM
To: user@lucenenet.apache.org
Subject: RE: Problems with Wildcard searches.

Hello Anders

At this point, I don't have any code to share (Unfortunately)... That is because the current
code is far too voluminous to share, so that would make no sense.
Currently my own investigation haven't has me ad a dead end, this is because when testing
against an older database (with about 60% of the data) I haven't been able to reproduce the
issue.

That obviously makes it a bit difficult to boil down the code to something that is meaningful
to share at the time... So next step will be to get a fresh backup of the data so I can try
to see if I can reproduce it on that and then slowly trim down the code from there to a minimal
example...

So for now, as I said, I was mainly hoping for a bit of luck in posting here, knowing that
it was quite a shoot in the blind until I have something more concrete. Thanks for your response
so far anyways though.


Med venlig hilsen / Kind regards

Jens Melgaard
System Architect

Systematic A/S
Søren Frichs Vej 39
8000 Aarhus C
Denmark

Mobile: +45 4196 5119
Jens.Melgaard@systematic.com

Season's greetings from Systematic
-----Original Message-----
From: Anders Lybecker [mailto:anders@lybecker.com]
Sent: 21. december 2017 12:40
To: user@lucenenet.apache.org
Subject: Re: Problems with Wildcard searches.

Hi Jens,

You are right. Something is wrong here.

Can you share some code, as this seems odd.

Regards,
Anders Lybecker (a fellow dane :-))

On Thu, Dec 21, 2017 at 11:16 AM, Jens Melgaard < Jens.Melgaard@systematic.com> wrote:

> Hello
>
> This is a bit of a shoot in blind, but while I try to see how I can 
> investigate further, I thought that I would try to see if we could be 
> lucky to hit someone who had experienced a similar issue as we are 
> facing right now.
>
>
>
> First a little bit of back ground.
> We use Lucene.Net 3.0.3 to index json documents, each json field gets 
> translated into a fieldname as you would access that field on the 
> document, so { obj: { fieldName: “42kittens” } } would be translated 
> into “obj.fieldName” = “42kittens” etc. Depending on the datatype from 
> json, each field is indexed differently but right now we can focus on 
> “text fields” as that is where our issue is atm.
>
>
>
> We use a StandardAnalyzer with an empty stopset and the query parser 
> is a slightly modified version of the MultiFieldQueryParser allowing 
> for using “*” in range queries as well as having a dynamic fields set 
> depending on what has been indexed. (We keep automatically track of 
> all possible fields in the system)
>
>
>
> We currently have about ~500.000 documents in our index, each document 
> ranges from ~10 fields to thousands of fields (each field may be 
> represented multiple times because of arrays), this results in about a 
> 4GB index.
>
>
>
> All in all everything seemed to work just fine, however yesterday we 
> discovered that we had some issues using wildcards.
>
>
>
> We have some documents which represents ports all over the world, 
> these have what is called a locode, a locode is always 5 characters, 
> e.g. DKAAR, VIFRD, ITPVT etc… The first 2 letters represent the 
> country, so DKAAR is in Denmark, VI is U.S. Virgin Island, IT is Itally. You can get
more here:
> http://locode.info (It might not be an exhausted list)…
>
> Now if I search for “locode: MA*” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
>
>
> However if I search for “locode: MAAGA” I get:
>
>
>
> -      MAAGA
>
>
>
> But that should have been included in the search above it as MA* 
> clearly should match MAAGA.
>
>
>
> If I search for “locode: (MA* OR MAAGA)” I get:
>
>
>
> -      MA888
>
> -      MA6KN
>
> -      MAAGA
>
>
> Now if I search for “locode: MAA*” I now get:
>
> -      MAAHU
>
> -      MAAZE
>
> -      MAANZ
>
> -      MAASI
>
> -      MAAGA
>
>
>
> Which all should be part of the first result right?...
>
>
>
> So I am thinking that there is something I am missing here…
>
> Med venlig hilsen / Kind regards
>
> [image: Systematic Logo] <http://www.systematic.com/> *Jens Melgaard* 
> System Architect
>
> Søren Frichs Vej 39
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>,
> 8000
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g> Aarhus C
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
> Denmark
> <https://maps.google.com/?q=S%C3%B8ren+Frichs+Vej+39,%0D+8000%0D+Aarhu
> s+C+%0D+Denmark&entry=gmail&source=g>
>
> Mobile: +45 4196 5119 <41%2096%2051%2019> Jens.Melgaard@systematic.com 
> www.systematic.com
>
> [image: Seasons greetings from systematic] <http://systematic.com/>
>
Mime
View raw message