lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: lucene deliberately removes \r (windows carriage char)
Date Sat, 03 Oct 2015 23:04:16 GMT
Hi,

I have the feeling Solr is causing this. Maybe better ask on their side, I am almost 100%
sure this has nothing to do with Lucene! The ReuseableStringReader you see is caused by the
way how Solr sets the field contents (as String). If the StringReader has no \r anymore, then
it is Solr's fault.

Please ask on:
solr-user@lucene.apache.org

This could for example be caused by the network transfer in XML or JSON format where newlines
could be normalized (XML generally does this).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ziqi Zhang [mailto:ziqi.zhang@sheffield.ac.uk]
> Sent: Saturday, October 03, 2015 7:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: lucene deliberately removes \r (windows carriage char)
> 
> Well this is very strange then. If I knew where exactly those "IndexableField"
> are constructed in the pipeline i could possibly pin down the bug...
> 
> In any case,  no I did not use MappingCharFilter or a BufferedReader.
> The way I pass content to analyse is straightforward:
>  >>>
> SolrInputDocument solrDoc = new SolrInputDocument();
> solrDoc.addField("content", "ok\r\nhere is the text\r\n"); ......
> 
> 
> The schema for the field "content" to be analysed begins with taking the text
> content in the field for tokenization:
>  >>>
> <analyzer type="index">
>                  <tokenizer
> class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
>                              sentenceModel="en-sent.bin"
>                              tokenizerModel="en-token.bin"/> .............
> 
> 
> Where OpenNLPTokenizerFactory creates a OpenNLPTokenizer, what is
> identical to the code provide at
> https://issues.apache.org/jira/browse/LUCENE-2899
> except that I adapted to Lucene 5.3
> 
> And by looking at the source code of OpenNLPTokenizer, I can see it is using
> the "input" variable (type Reader) of the superclass Tokenizer to get the text
> content to be analyzed. At runtime through debugging I see that "input" is
> instantiated as a "ReusableStringReader", and you can see the string value
> has become "ok\nhere is the text\n"
> 
> Any other thoughts please?
> 
> 
> On 03/10/2015 17:37, Michael McCandless wrote:
> > Are you using MappingCharFilter?
> >
> > It unfortunately has known bugs which require controversial API
> > changes to fix: https://issues.apache.org/jira/browse/LUCENE-6595
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sat, Oct 3, 2015 at 6:02 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >> Hi,
> >>
> >> Lucene does not remove the \r\n while indexing or storing fields. The
> Analyzer just splits e.g., at whitespace (depends on Analyzer). So if you
> original data has \r\n, then the offsets would be according to that (it counts 2
> chars).
> >>
> >> Could it be that you read it using a BufferedReader per line and pass as
> Strings?
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>> -----Original Message-----
> >>> From: Ziqi Zhang [mailto:ziqi.zhang@sheffield.ac.uk]
> >>> Sent: Saturday, October 03, 2015 5:01 PM
> >>> To: java-user@lucene.apache.org
> >>> Subject: lucene deliberately removes \r (windows carriage char)
> >>>
> >>> Hi
> >>>
> >>> I am trying to pin-point a mismatch between the offsets produced by
> >>> lucene indexing process when I use the offsets to substring from the
> >>> original document content.
> >>>
> >>> I try to debug as far as I can go but I lost track of lucene when I
> >>> am at line 298 of DefaultIndexingChain (lucene 5.3.0):
> >>>
> >>> for (IndexableField field : docState.doc) {
> >>>           fieldCount = processField(field, fieldGen, fieldCount);
> >>>         }
> >>>
> >>> Basically at this point I can see that the content field (one of the
> >>> IndexableField) I am interested in has already removed all "\r" from
> >>> the "\r\n" newline characters (windows) from the content. But I am
> >>> unable to trace how these IndexableField are generated, and how the
> >>> raw content is passed to them.
> >>>
> >>> I can be certain that my program did pass strings with lots of "\r\n"
> >>>
> >>> So the question is is this (i.e., removing \r) deliberate?
> >>>
> >>> Thanks
> >>>
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> 
> 
> --
> Ziqi Zhang
> Research Associate
> Department of Computer Science
> University of Sheffield
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message