lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject RE: can't find common words -- using Lucene 3.4.0
Date Wed, 28 Mar 2012 14:46:36 GMT

I had to pull different pieces of the code below from different places in my system, but here
what I do:

		Analyzer anIndx = new StandardAnalyzer(Version.LUCENE_34);
		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, anIndx);
		if (create == true)
		Directory dir = File(fPath));
		IndexWriter writer = new IndexWriter(dir, iwc);

Anything suspicious here?


Ilya Zavorin

-----Original Message-----
From: Steven A Rowe [] 
Sent: Monday, March 26, 2012 1:48 PM
Subject: RE: can't find common words -- using Lucene 3.4.0 

On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
> I am not seeing anything suspicious. Here's what I see in the HEX:
> "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65
> (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48

I agree, standard DOS/Windows line endings.

> I am pretty sure I am using the std analyzer

Interesting.  I'm quite sure something else is going on besides StandardAnalyzer, since StandardAnalyzer
(more specifically, StandardTokenizer) always breaks tokens on whitespace, and excludes punctuation
at the end of tokens.  In case you're interested, the "standard" to which StandardTokenizer
(v3.1 - v3.5) conforms is the Word Boundaries rules from Unicode 6.0.0 standard annex #29
aka UAX#29: <>.

Can you share the code where you construct your analyzer and IndexWriterConfig?

> Here's how I add a doc to the index (oc is String containing the whole document):
> doc.add(new Field("contents", 
> 		oc, 
> 		Field.Store.YES,
> 		Field.Index.ANALYZED, 
> 		Field.TermVector.WITH_POSITIONS_OFFSETS));
> Can this affect the indexing?

The way you add the Field looks fine.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message