lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: lucene deliberately removes \r (windows carriage char)
Date Sat, 03 Oct 2015 16:37:44 GMT
Are you using MappingCharFilter?

It unfortunately has known bugs which require controversial API
changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595

Mike McCandless

http://blog.mikemccandless.com

On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi,
>
> Lucene does not remove the \r\n while indexing or storing fields. The Analyzer just splits
e.g., at whitespace (depends on Analyzer). So if you original data has \r\n, then the offsets
would be according to that (it counts 2 chars).
>
> Could it be that you read it using a BufferedReader per line and pass as Strings?
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Ziqi Zhang [mailto:ziqi.zhang@sheffield.ac.uk]
>> Sent: Saturday, October 03, 2015 5:01 PM
>> To: java-user@lucene.apache.org
>> Subject: lucene deliberately removes \r (windows carriage char)
>>
>> Hi
>>
>> I am trying to pin-point a mismatch between the offsets produced by lucene
>> indexing process when I use the offsets to substring from the original
>> document content.
>>
>> I try to debug as far as I can go but I lost track of lucene when I am at line 298
>> of DefaultIndexingChain (lucene 5.3.0):
>>
>> for (IndexableField field : docState.doc) {
>>          fieldCount = processField(field, fieldGen, fieldCount);
>>        }
>>
>> Basically at this point I can see that the content field (one of the
>> IndexableField) I am interested in has already removed all "\r" from the
>> "\r\n" newline characters (windows) from the content. But I am unable to
>> trace how these IndexableField are generated, and how the raw content is
>> passed to them.
>>
>> I can be certain that my program did pass strings with lots of "\r\n"
>>
>> So the question is is this (i.e., removing \r) deliberate?
>>
>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message