lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shad Storhaug <s...@shadstorhaug.com>
Subject Lucene.Net.Analysis.Common Work
Date Wed, 24 Aug 2016 09:57:13 GMT
Itamar,

Just updating you regarding the work on Analysis on the mailing list as per your request.

To the rest, if anyone wants to help close out Analysis.Common, please coordinate with Itamar.

Scope of Remaining 43 Failing Tests

17 of the failing tests are in the Analysis.Synonym namespace (out of 51). I am not sure exactly
what is happening here, but I am convinced that this issue is local to the namespace (or the
tests) and it can be overcome with a little diligence. I plan to have another look at this
today. Anyone who beats me to a solution gets an "atta-boy" (or "atta-girl") from me.

14 of the failing tests are in the Analysis.Th namespace (out of 16), and is subsequently
causing the Analysis.Core.TestFactories.Test() method to fail (15 failing tests total). That
is, Thai support is completely broken. Using the Clone() method of the BreakIterator subsequently
results in null reference exceptions when SetText() is called. It is unclear why cloning is
even necessary, though, and it appears this issue can be overcome by creating the iterators
using the static methods instead of cloning. However the Thai words are still not breaking
correctly, and I am not sure what is happening there. Looks like the documentation needs to
be consulted on ICU to figure out how to use it properly. I haven't tried all of the available
locales (only US English), but I do find it odd that there is no Locale for Thai in the package
though since that appears to be the only language it supports (well, perhaps Lao, Khmer and
Burmese will need this, but we don't have support for them anyway). I came up with another
thought today that I didn't try as well - perhaps the culture of the current thread needs
to be changed to Thai in order to make it work. Worth another look.

5 of the failing tests are in the Analysis.Compound namespace (out of 17). I had to take a
slightly different approach than the Java implementation because it uses a SAX XML parser
of which there is no direct equivalent in .NET, so made do with the XmlReader. It appears
to be loading the data correctly, but it is difficult to verify this without cross-checking
it against its Java counterpart. So, it is unknown whether the bugs are in the XML parsing,
the data structure, the code that uses the data structure, some dependency of the code, or
all of the above.

2 of the failing tests are in the HtmlStripCharFilterTest class. For some reason, there are
null characters being added to the random tests which are not parsing correctly. The data
being printed out in the log is not helpful to determine what the input is that is causing
the failures. I attempted to pass null characters in using a couple of different forms and
the parser seems to handle them fine. My gut feeling tells me that this is a problem with
the random strings that are being generated for the test, not a problem with the parser.

1 of the failing tests is in the Analysis.Tr namespace (Turkish). The cause of this failure
appears to be in the core. The Character.CodePointAt() method is returning character 304 rather
than the expected LATIN_CAPITAL_LETTER_I + non-spacing mark character. I didn't look into
it further than that.

1 of the failing tests, Analysis.Core.TestBugInSomething.TestWrapping() appears to be an issue
with porting the test, and I will address that today.

1 of the failing tests, Analysis.Pattern.TestCaptureGroupTokenFilter.TestRandomString() is
(I suspect) another issue with the random string generation.

1 of the failing tests, Analysis.Util.TestCharArrayMap.TestToString() is failing because we
are not passing an object reference of CharArrayMap into the CharArraySet when it is instantiated
in the CharArrayMap.KeySet() method. Without the direct object reference, calling ToString()
on the CharArrayMap.Keys property returns an incorrect result because the CharArraySet and
CharArrayMap are not referencing the same array. As far as I can tell, this is not causing
an issue other than this one test.

It's a tricky issue because CharArrayMap is generic and CharArraySet is not. There are at
least 3 ways we could solve this:


1.       Wrap the array (and other shared dependencies) into an internal reference type so
both classes are referring to the same instance.

2.       Create a non-generic ICharArrayMap interface that references the few generic members
as type object, and use the interface rather than the concrete type as the type to pass to
CharArraySet.

3.       Make CharArraySet generic so its type remains in sync with the CharArrayMap.

I took a stab at #3 on this branch: https://github.com/NightOwl888/lucenenet/tree/analysis-chararray-generic,
and it appears like it can compile that way, but it caused some other problems and I didn't
get a chance to see if they could all be resolved.

It also occurred to me that rather than making an abstract CharArraySet and generic CharArraySet<V>,
it might be worth considering making an ICharArraySet interface to pass the concrete type
around without knowing its type, and making the CharArraySet a superclass of CharArraySet<string>,
since more than 95% of the time we are using it with a string anyway (how many words do you
know that are not strings?). It might be confusing, though - most people would expect an untyped
CharArraySet class to be type object rather than string.

Not Yet Ported

There are just a few (known) remaining pieces missing out of Analysis.Common.


1.       Collation - Looks like we need to port over the Java Collator class. I took a stab
at it in [this WIP commit](https://github.com/NightOwl888/lucenenet/commit/e6a037b2bb971b1015b2fc96eca663267df08273).
AFAIK, ChristopherHaws is porting it (https://github.com/apache/lucenenet/pull/181#issuecomment-241481735),
check with him if you want to give it a try as there doesn't seem to be much to it but will
require some finesse to make it work with .NET.

2.       Analysis.Core.UpperCaseFilterFactory - Not sure how I missed it, but I will get on
this today.

3.       Analysis.Miscellaneous.PatternAnalyzer - Obsolete and superseded by the Analysis.Pattern
namespace. Do we really need this?

4.       Analysis.Core.TestAllAnalyzersHaveFactories

5.       Analysis.Core.TestRandomChains

6.       Analysis.Miscellaneous.PatternAnalyzerTest

7.       Analysis.Miscellaneous.TestSingleTokenFilter

8.       Analysis.Util.TestAnalysisSPILoader


Testing Issues

As for putting all of the Hunspell language dictionary (in compatible versions) into the repository
to enable automated testing, I am not sure how well that would be received. The binaries are
together an additional 108 MB, which means a lot of extra stuff to download if cloning Lucene.Net.
In addition, the tests take quite a while to finish - at least 20-30 minutes. I think the
Java dev team made the right choice by making this a manual task.

In any case, I have created a repo here: https://github.com/NightOwl888/lucenenet-hunspell
that you can clone/copy/rehost to share these known working dictionary files with users of
Lucene.Net. That is not to say that other versions of dictionaries won't work, but there can
be issues when .NET doesn't recognize the encoding string or there are tags that the Hunspell
parser doesn't recognize.

It also occurred to me that there are probably other hidden issues that won't rear their ugly
head unless we change the culture of the current thread while testing. We should modify the
test framework to test another culture on demand or perhaps randomly to weed out any localization
issues while parsing strings.


Merging to Master/Prerelease

The only real concern here might be that the solution ultimately used to fix (or not fix)
CharArraySet might end up changing the public API of CharArraySet. Other than that, I don't
see any real issues that are holding up the release here. Thoughts?


Shad Storhaug/NightOwl888


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message