lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 23650] - docs out of order
Date Thu, 02 Mar 2006 23:16:41 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=23650>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=23650





------- Additional Comments From apachez@home.se  2006-03-03 00:16 -------
This is also happening in the perl port of lucene named plucene (at time of 
writing latest version is 1.24 through CPAN).

I have tracked this down to that it depends on which characters are being 
allowed by the tokenizer.

If I use WhitespaceAnalyzer (since I want to cover swedish chars and other 
chars which are not a-z as the SimpleAnalyzer uses) the default value of the 
WhitespaceTokenizer (which is being used by the WhitespaceAnalyzer) is:

sub token_re { qr/\S+/ }

When using the default tokenizer above the indexing will fail with an error 
similar to:

Docs out of order (44 < 53) 
at /usr/local/share/perl/5.8.4/Plucene/Index/SegmentMerger.pm line 149.

But when changing the token_re function into:

sub token_re { qr/[a-z\d]+/ }

which will only allow a-z and 0-9 the indexing has no problems what so 
ever (at least I dont get the above error message).

But when adding swedish chars for the tokenizer such as:

sub token_re { qr/[a-z\d]+/ }

the error of "Docs out of order" returns...

Kind Regards
Apachez

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message