lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Fw: [contrib]: XMLIndexer/StringFilter
Date Sun, 22 Sep 2002 20:09:29 GMT
Hi All:
I pack the some source code I wrote before extend to lucene project. Hope it can be added
into sandbox and get more communications with other also interesting following issues: including

customize search result 
sorting, search result filtering, 
common xml indexing source format, 
Asia language(Chinese Korean Japanese) word segment analyer support
etc...

File list: 
sample.xml: sample index source
lucene.dtd: lucene index source xml data type defination

org/apache/lucene
analysis/
        /cjk/CJKTokenizer.java:  java based  tokenizer(CJK bigram)
        /standard/StandardTokenizer.java: JavaCC based tokenizer(CJK sigram)
search/
      /IndexSearcher.java: support sort by docID(desc) beside sort by score.
      /StringFilter.java: string match or prefix match index filter
demo/
    /XMLIndexer.java: indexing xml source which mapping to lucene index file(not tested).
   
Regards

Che, Dong

Attach with README

Lucene extend package

Author: Che, Dong <chedong@bigfoot.com>
$Header: /home/cvsroot/lucene_ext/README,v 1.1.1.1 2002/09/22 19:36:08 chedong Exp $

Introduction
============
There is some source code extend to lucene project for some purpose: customize search result

sorting, search result filtering, common xml indexing source format, Asia language(Chinese

Korean Japanese) word segment analyer support...

File list: 

sample.xml: sample index source
lucene.dtd: lucene index source xml data type defination

org/apache/lucene
analysis/
        /cjk/CJKTokenizer.java:  java based  tokenizer(CJK bigram)
        /standard/StandardTokenizer.java: JavaCC based tokenizer(CJK sigram)
search/
      /IndexSearcher.java: support sort by docID(desc) beside sort by score.
      /StringFilter.java: string match or prefix match index filter
demo/
    /XMLIndexer.java: indexing xml source which mapping to lucene index file(not tested).
   

INSTALL
=======
Required jar: lucene-version.jar xerces.jar(only XMLIndexre needed), 
Please sure these two jar file included in your CLASSPATH

check the javacc related configure in build.xml fit you environment.
build:
ant 
ant javadocs

TODO
====
1 Bigram based word segment in StandardTokenizer.jj:
I still not familar with JavaCC, I try to use getNextToken() in StandardTokenizer.next() to

implement over lap match: C1C2C3C4 ==> C1C2 C2C3 C3C4 
or even to  C1C2 C2C3 C3C4 C4 / C1 C1C2 C2C3 C3C4

2 more complex lucene index source binding:
indexType: DateIndex etc...make one lucene.dtd(or schema) as the common lucene indexing source
format:
source WORD       PDF     HTML    DB       other
         \          |       |      |         /
                       xml(lucene.dtd) 
                            |
                   XMLIndexer.build(XML InputSource)
                            |
                     Lucene INDEX

3 IndexSearcher:
Lower-level search API search() still not docID order search able

4 test suit for above package:

Mime
View raw message