Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 76080 invoked from network); 23 Apr 2003 16:51:55 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 23 Apr 2003 16:51:55 -0000 Received: (qmail 12611 invoked by uid 97); 23 Apr 2003 16:53:54 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 12604 invoked from network); 23 Apr 2003 16:53:54 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 23 Apr 2003 16:53:54 -0000 Received: (qmail 73561 invoked by uid 500); 23 Apr 2003 16:51:24 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 73457 invoked from network); 23 Apr 2003 16:51:23 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 23 Apr 2003 16:51:23 -0000 Received: (qmail 12571 invoked by uid 50); 23 Apr 2003 16:53:22 -0000 Date: 23 Apr 2003 16:53:22 -0000 Message-ID: <20030423165322.12570.qmail@nagoya.betaversion.org> From: bugzilla@apache.org To: lucene-dev@jakarta.apache.org Cc: Subject: DO NOT REPLY [Bug 19253] New: - HTML parser should treat as a word break element X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19253 HTML parser should treat as a word break element Summary: HTML parser should treat as a word break element Product: Lucene Version: 1.2 Platform: All URL: http://bugs.eclipse.org/bugs/show_bug.cgi?id=36378 OS/Version: All Status: NEW Severity: Minor Priority: Other Component: Examples AssignedTo: lucene-dev@jakarta.apache.org ReportedBy: konradk@ca.ibm.com When parsing HTML code " abc
xyz " the HTML parser skips over elements and concatenates text around them without separating them with white space, in that case producing abcxyz. Searching resulting index will not be able to find the abc. At least for tags ,

,
,

,
,

-

,
  • , and the parser should separate string on both sides of tags with space. Using square brackets "[", or "]" for separating gthe strings will also work as it is already used for text in ALT attribute of images. There is a workaround for this bug to add spaces when authoring HTML code, but that may not always be done if documents are created by somebody else. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org