lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 19253] New: - HTML parser should treat <td> as a word break element
Date Wed, 23 Apr 2003 16:53:22 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19253>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19253

HTML parser should treat <td> as a word break element

           Summary: HTML parser should treat <td> as a word break element
           Product: Lucene
           Version: 1.2
          Platform: All
               URL: http://bugs.eclipse.org/bugs/show_bug.cgi?id=36378
        OS/Version: All
            Status: NEW
          Severity: Minor
          Priority: Other
         Component: Examples
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: konradk@ca.ibm.com


When parsing HTML code " abc</td><dt>xyz " the HTML parser skips over elements

and concatenates text around them without separating them with white space, in 
that case producing abcxyz.  Searching resulting index will not be able to find 
the abc.

At least for tags <td>, <p>, <br>, <blockquote>, <dt>, <h1>
- <h6>, <li>, and 
<q> the parser should separate string on both sides of tags with space.  Using 
square brackets "[", or "]" for separating gthe strings will also work as it is 
already used for text in ALT attribute of images.

There is a workaround for this bug to add spaces when authoring HTML code, but 
that may not always be done if documents are created by somebody else.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message