lucene-pylucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andi Vajda <va...@apache.org>
Subject Re: SynonymAnalyzer(s) in PyLucene34
Date Thu, 27 Oct 2011 13:09:35 GMT

  Hi Thomas, (and Mike, for questions)

On Thu, 27 Oct 2011, Thomas Koch wrote:

> while I was playing with the SynonymAnalyzer stuff (pylucene-3.4 samples) I
> discovered that the wordnet example is broken due to an outdated wordnet
> database: The SynonymAnalyzerTest works fine, but the SynonymAnalyzerViewer
> fails with:
> ...lucene.JavaError: org.apache.lucene.index.IndexFormatTooOldException:
> Format version is not supported in file 'segments': 44132 (needs to be
> between -1 and -11). This version of Lucene only supports indexes created
> with release 3.0 and later.
>
> The WordNetSynonymEngine uses an index contained in the indexes.tgz file
> which is looked up in indexes\wordnet - this file (dated 2004) seems to be
> an old lucene index format. I managed to find the files required to build
> the index for lucene-3.4, adjusted the WordNetSynonymEngine to work with
> lucene 3.4 and all seems to be working again. I've created an archive with
> the relevant changes and uploaded it to the pylucene-extras project - just
> in case anyone is interested:
> http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list
>
> BTW, who is maintaining/updating the samples that are included in the
> distribution?

Most PyLucene samples are ports of the first edition Lucene in Action book 
to PyLucene. I ported them all and I'm the maintainer. If you find bugs, 
patches are of course welcome.

It looks like the wordnet index file is coming directly from the Lucene in 
Action downloadable sample. Given that this is now seven years old, 
something like that was bound to happen.

It looks like the second edition of the book has samples that were written 
for Lucene 3.0.2:
   http://www.manning.com/hatcher3/
   http://www.manning.com/hatcher3/LIAsourcecode.zip

So, I downloaded the new version of the samples, hoping to find a new 
version of the wordnet index. But first, following instructions in README, 
running 'ant test' in the lia2e directory fails with:
     Testcase: testWriteLock(lia.indexing.LockTest):	Caused an ERROR
     [junit] Unknown format version: -11
during the CreateTestIndex step. Mike, what could that be ? (running Lucene 
3.4.0)

Ignoring this failure, 'ant SynonymAnalyzerViewer' runs fine. The new 
version doesn't seem to be using the wordnet index anymore. Yet the code 
that would be is commented out, so I'm wondering what the intent was.

But since you did the work, Thomas, I followed your instructions and rebuilt 
the wordnet index used by this sample in the earlier version and refreshed 
the indexes.tar.gz archive with the new wordnet one built from Lucene 3.4.0.
The other two indexes in there, t9 and distributed, most likely suffer from 
the same problem but I didn't check.

SynonymAnalyzerViewer.py is now run as part of the 'make test' suite.

This is checked into rev 1189735 of branch_3x.

Many thanks !

Andi..

> It should be noted that the SynonymAnalyzer examples are based on the lia
> book and implement their own Synonym support while there is currently
> already support for SynonymAnalyzer in java-lucene-3.4:  package
> org.apache.lucene.analysis.synonym;  (in contrib)
>
> see CHANGELOG
> LUCENE-3233, LUCENE-3375: Added SynonymFilter for applying multi-word
> synonyms during indexing or querying (with parsers for wordnet and solr
> formats). Removed contrib/wordnet.
>
> It's already included in the PyLucene core: lucene.SynonymFilter - however I
> couldn't find any samples / tests for this new feature - will have to play
> with this one as well... Let me know if anyone has made experience with the
> new lucene.SynonymFilter and possible advantages over the Python-based
> implementation (in
> pylucene-3.4\samples\LuceneInAction\lia\analysis\synonym).
>
>
> regards
> Thomas
> --
> OrbiTeam Software GmbH & Co. KG
> Endenicher Allee 35
> 53121 Bonn - Germany
> http://www.orbiteam.de
>
>
>

Mime
View raw message