lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: fixed url and How to contribute code to lucene sandbox?
Date Wed, 11 Sep 2002 22:09:15 GMT
Che Dong wrote:
> 1. custom sorting beside default score sorting: make docID alias one field you need output
sorting
> solved  by sort data before indexing(example sorted by field PostDate), so docID can
be an alias to the sort field. if we make hitCollector
> sort with docID or 1/docID or even complex stragety (docID * score)...
> http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115469
> IndexOrderSearcher: sort data before indexing and use 1/docID instead of score 

That's an interesting approach.  I don't recall ever seeing this message 
when it was originally posted.  Sorry.

I had imagined instead adding this functionality to Hits.java.  Having 
a different Searcher implementation makes it possible for folks to use 
MultiSearcher to combine results from an IndexSearcher and an 
IndexOrderSearcher, which would not make sense.  If the functionality 
instead resides in Hits.java, then it could not be misused in this way.

So the way I was going to do it was to add something to Hits.java like:
   public static final long ORDER_BY_SCORE = 1;
   public static final long ORDER_BY_DOC_NUM = 1;
   public void setHitOrdering(int order);

If ORDER_BY_SCORE is specfied then Hits would work as it does now.  This 
would be the default.  But when ORDER_BY_DOC_NUM is specified then 
Hits.java would use a HitCollector to implement this ordering.

> 2. CJK support: 
>        2.1 sigram based(no word segment just use one character as a token):  modified
from StandardTokenizer.java
>     http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
>     CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
>     http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
>     StandardTokenizer with sigram based CJK Support
> 
>     2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
>     http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html

I think it would be great to have some support for asian languages built 
into Lucene.  Which of these approaches do you think is best?  I like 
the idea of a StandardTokenizer or SimpleTokenizer that automatically 
provides this via bigrams.  What do others think?

Doug



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message