lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2950) Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
Date Tue, 08 Mar 2011 20:49:59 GMT


Robert Muir commented on LUCENE-2950:

bq. How would this work? E.g. many contribs depend on the common-analyzers module. Removing
this dependency would almost certainly make the contribs non-functional.

The dependency is mostly bogus. Here is the contribs in question:
* ant
* demo
* lucli
* misc
* spellchecker
* swing
* wordnet

For example the ant IndexTask only depends on this so it can make this hashmap:
    static {
      analyzerLookup.put("simple", SimpleAnalyzer.class.getName());
      analyzerLookup.put("standard", StandardAnalyzer.class.getName());
      analyzerLookup.put("stop", StopAnalyzer.class.getName());
      analyzerLookup.put("whitespace", WhitespaceAnalyzer.class.getName());

I think we could remove this, e.g. it already has reflection code to build the analyzer, if
you supply "Xyz" why not just look for XyzAnalyzer as a fallback?

The lucli code has 'StandardAnalyzer' as a default: I think its best to not have a default
analyzer at all. I would have fixed this already: but this contrib module has no tests! This
makes it hard to want to get in there and clean up.

The misc code mostly supplies an Analyzer inside embedded tools that don't actually analyze
anything. We could add a pkg-private NullAnalyzer that throws UOE on its tokenStream() <--
especially as they shouldnt be analyzing anything, so its reasonable to do?

The spellchecker code has a hardcoded WhitespaceAnalyzer... why is this? Seems like the whole
spellchecking n-gramming is wrong anyway. Spellchecker uses a special form of n-gramming that
depends upon the word length. Currently it does this in java code and indexes with WhitespaceAnalyzer
(creating a lot of garbage in the process, e.g. lots of Field objects), but it seems this
could all be cleaned up so that the spellchecker uses its own SpellCheckNgramAnalyzer, for
better performance to boot.

The swing code defaults to a whitespaceanalyzer... in my opinion again its best to not have
a default analyzer and make the user somehow specify one.

The wordnet code uses StandardAnalyzer for indexing the wordnet database. It also includes
a very limited SynonymTokenFilter. In my opinion, now that we merged the SynonymTokenizer
from solr that supports multi-word synonyms etc (which this wordnet module DOES NOT!), we
should nuke this whole thing. 

Instead, we should make the synonym-loading process more flexible, so that one can produce
the SynonymMap from various formats (such as the existing Solr format, a relational database,
wordnet's format, or openoffice thesaurus format, among others). We could have parsers for
these various formats. This would allow us to have a much more powerful synonym capability,
that works nicely regardless of format. We could then look at other improvements, such as
allowing SynonymFilter to use a more ram-conscious datastructure for its Synonym mappings
(e.g. FST), and everyone would see the benefits.
So hopefully this entire contrib could be deprecated.

> Modules under top-level modules/ directory should be included in lucene's build targets,
e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
> ----------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: LUCENE-2950
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Blocker
>             Fix For: 4.0
> Lucene's top level {{modules/}} directory is not included in the binary or source release
distribution Ant targets {{package-tgz}} and {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}.
 (However, these targets do include Lucene contribs.)
> This issue is visible via the nightly Jenkins (formerly Hudson) job named "Lucene-trunk",
which publishes binary and source artifacts, using {{package-tgz}} and {{package-tgz-src}},
as well as javadocs using the {{javadocs}} target, all run from the top-level {{lucene/}}

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message