lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Motivation for white space after entities in HTMLStripReader
Date Sat, 22 Nov 2008 22:06:43 GMT
Sure, a patch would be fine.

On Nov 22, 2008, at 4:31 AM, Dawid Weiss wrote:

> Thanks Grant. You mean this issue: 
> , I see now. This is a problem for me only, I guess, because I use  
> HTMLStripReader independently of the Lucene architecture. This class  
> is public, would it make sense if I provided a patch that would  
> switch the whitespace emitting functionality on and off, depending  
> on a particular person's use case?
> Dawid
> Grant Ingersoll wrote:
>> It is an attempt at making things work properly with the  
>> highlighter (such that offsets are correct).  I believe it works  
>> most of the time, but there still might be a few issues, check JIRA.
>> -Grant
>> On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:
>>> Hi folks. What's the motivation to add exactly the number of white  
>>> spaces after an entity declaration in HTMLStripReader? It  
>>> basically looks like this:
>>> "l&oacute;d"
>>> (UTF: lód, "ice" in Polish) is translated into:
>>> "ló       d"
>>> This happens both with numeric entities and named entities.  
>>> Needless to say, these added spaces in the character stream do no  
>>> good as they effectively split a single term "lód" into two  
>>> meaningless terms "l" and "d".
>>> I can fix this in the code easily, but it looks like it was  
>>> intentional, so before I write test cases and commit a JIRA issue  
>>> I would like to understand what the original reasons might have  
>>> been (I really don't see anything this would be useful for).  
>>> Apologies if I'm being dim here.

View raw message