cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vadim Gritsenko <va...@reverycodes.com>
Subject Re: cvs commit: cocoon-2.1 status.xml
Date Tue, 09 Mar 2004 12:43:28 GMT
Joerg Heinicke wrote:

> On 09.03.2004 02:39, Vadim Gritsenko wrote:
>
>>>>       public void characters(char[] ch, int start, int length) {
>>>>             if (ch.length > 0 && start >= 0 && length
> 1) {
>>>>  -            String text = new String(ch, start, length);
>>>>               if (elementStack.size() > 0) {
>>>>                   IndexHelperField tos = (IndexHelperField) 
>>>> elementStack.peek();
>>>>  -                tos.appendText(text);
>>>>  +                tos.appendText(ch, start, length);
>>>>               }
>>>>  -            bodyText.append(text);
>>>>  +            bodyText.append(' ');
>>>>  +            bodyText.append(ch, start, length);
>>>>           }
>>>>       }
>>>>
>>>
>>> What will happen when "keyword" text is streamed as two characters 
>>> events, "key" and "word"? I think it will become "key word", and 
>>> indexing will break.
>>>
>>> IIUC, idea was to add a space in between tags, i.e. so 
>>> <p>some</p><p>text</p> is not indexed as "sometext".
If that's 
>>> correct, then better fix would be to add space only if boolean flag 
>>> had_start_or_end_element_in_between_char_events set.
>>
>>
>> Joerg?
>
>
> Your mail was neither ignored nor accidently deleted - I just didn't 
> know what really to write, but marked it as important in nice red 
> color in Mozilla :)


:-)


> Yes, I see your objection - and asked for them already in the bug 
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>
> So what are the practical use cases this might occure? Maybe it's only 
> a theoretical problem depending on the "thing" the index is created 
> from? On which SAX stream the LuceneIndexHandler operates?


I remember there were issues already in other components with text being 
splitted up onto multiple character events. So, think of this as of 
preventive maintenance.


> I also don't get your implications for 
> "had_start_or_end_element_in_between_char_events". But I had a look on 
> the endElement(). It gets the elements from a stack and already tests 
> for text:
>     if (text != null && text.length() > 0) {
> Would it make sense to add the space in endElement, if the element 
> contains text, i.e. the above is true?


This was my first though... But then, multiple closing tags will cause 
multiple spaces... So, I thought, this should work:

startElement:
    flag = true;

endElement:
    flag = true;

characters:
    if (flag)
        x.append(' ');
        flag = false;

Does it solves the problem?

Vadim



Mime
View raw message