lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
Subject lucene deliberately removes \r (windows carriage char)
Date Sat, 03 Oct 2015 15:01:19 GMT
Hi

I am trying to pin-point a mismatch between the offsets produced by 
lucene indexing process when I use the offsets to substring from the 
original document content.

I try to debug as far as I can go but I lost track of lucene when I am 
at line 298 of DefaultIndexingChain (lucene 5.3.0):

for (IndexableField field : docState.doc) {
         fieldCount = processField(field, fieldGen, fieldCount);
       }

Basically at this point I can see that the content field (one of the 
IndexableField) I am interested in has already removed all "\r" from the 
"\r\n" newline characters (windows) from the content. But I am unable to 
trace how these IndexableField are generated, and how the raw content is 
passed to them.

I can be certain that my program did pass strings with lots of "\r\n"

So the question is is this (i.e., removing \r) deliberate?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message