Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 33849 invoked from network); 15 Jun 2006 21:41:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Jun 2006 21:41:52 -0000 Received: (qmail 70877 invoked by uid 500); 15 Jun 2006 21:41:46 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 70848 invoked by uid 500); 15 Jun 2006 21:41:46 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 70834 invoked by uid 99); 15 Jun 2006 21:41:46 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jun 2006 14:41:46 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jun 2006 14:41:45 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 28737410006 for ; Thu, 15 Jun 2006 21:40:30 +0000 (GMT) Message-ID: <11605088.1150407630147.JavaMail.jira@brutus> Date: Thu, 15 Jun 2006 21:40:30 +0000 (GMT+00:00) From: "Daniel Naber (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-590) Demo HTML parser gives incorrect summaries when title is repeated as a heading In-Reply-To: <515860.1149694410298.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-590?page=all ] Daniel Naber updated LUCENE-590: -------------------------------- Description: If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case. In HTMLParser.jj's getSummary(): String sum = summary.toString().trim(); String tit = getTitle(); if (sum.startsWith(tit) || sum.equals("")) return tit; else return sum; change it to: (* denotes a line that has changed) String sum = summary.toString().trim(); String tit = getTitle(); * if (sum.startsWith(tit)) // don't repeat title in summary * return sum.substring(tit.length()).trim(); else return sum; was: If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case. In HTMLParser.jj's getSummary(): String sum = summary.toString().trim(); String tit = getTitle(); if (sum.startsWith(tit) || sum.equals("")) return tit; else return sum; change it to: (* denotes a line that has changed) String sum = summary.toString().trim(); String tit = getTitle(); * if (sum.startsWith(tit)) // don't repeat title in summary * return sum.substring(tit.length()).trim(); else return sum; Priority: Minor (was: Major) decrease priority (affects demo only) > Demo HTML parser gives incorrect summaries when title is repeated as a heading > ------------------------------------------------------------------------------ > > Key: LUCENE-590 > URL: http://issues.apache.org/jira/browse/LUCENE-590 > Project: Lucene - Java > Type: Bug > Components: Examples > Versions: 2.0.0 > Reporter: Curtis d'Entremont > Priority: Minor > > If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case. > In HTMLParser.jj's getSummary(): > String sum = summary.toString().trim(); > String tit = getTitle(); > if (sum.startsWith(tit) || sum.equals("")) > return tit; > else > return sum; > change it to: (* denotes a line that has changed) > String sum = summary.toString().trim(); > String tit = getTitle(); > * if (sum.startsWith(tit)) // don't repeat title in summary > * return sum.substring(tit.length()).trim(); > else > return sum; -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org