Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 39179 invoked from network); 3 Feb 2005 10:17:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 3 Feb 2005 10:17:59 -0000 Received: (qmail 15160 invoked by uid 500); 3 Feb 2005 10:17:51 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 15140 invoked by uid 500); 3 Feb 2005 10:17:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 15123 invoked by uid 99); 3 Feb 2005 10:17:51 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from proserver2.ifit.uni-klu.ac.at (HELO proserver2.ifit.uni-klu.ac.at) (143.205.118.212) by apache.org (qpsmtpd/0.28) with ESMTP; Thu, 03 Feb 2005 02:17:49 -0800 Received: from [143.205.118.98] ([143.205.118.98]) by proserver2.ifit.uni-klu.ac.at over TLS secured channel with Microsoft SMTPSVC(5.0.2195.6713); Thu, 3 Feb 2005 11:17:46 +0100 Message-ID: <4201FA4A.5030202@ifit.uni-klu.ac.at> Date: Thu, 03 Feb 2005 11:17:46 +0100 From: sergiu gordea User-Agent: Mozilla Thunderbird 0.7 (Windows/20040616) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: which HTML parser is better? References: <9281.1107425153@www32.gmx.net> In-Reply-To: <9281.1107425153@www32.gmx.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 03 Feb 2005 10:17:46.0940 (UTC) FILETIME=[9B626FC0:01C509D9] X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Karl Koch wrote: >I appologise in advance, if some of my writing here has been said before. >The last three answers to my question have been suggesting pattern matching >solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing >is something I cannot use since I work with Java 1.1 on a PDA. > > I see, In this case you can read line by line your HTML file and then write something like this: String line; int startPos, endPos; StringBuffer text = new StringBuffer(); while((line = reader.readLine()) != null ){ startPos = line.indexOf(">"); endPos = line.indexOf("<"); if(startPos >0 && endPos > startPos) text.append(line.substring(startPos, endPos)); } This is just a sample code that should work if you have just one tag per line in the HTML file. This can be a start point for you. Hope it helps, Best, Sergiu >I am wondering if somebody knows a piece of simple sourcecode with low >requirement which is running under this tense specification. > >Thank you all, >Karl > > > >>No one has yet mentioned using ParserDelegator and ParserCallback that >>are part of HTMLEditorKit in Swing. I have been successfully using >>these classes to parse out the text of an HTML file. You just need to >>extend HTMLEditorKit.ParserCallback and override the various methods >>that are called when different tags are encountered. >> >> >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: >> >> >> >>>Three HTML parsers(Lucene web application >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in >>>Lucene FAQ >>>1.3.27.Which is the best?Can it filter tags that are >>>auto-created by MS-word 'Save As HTML files' function? >>> >>> >>-- >>Bill Tschumy >>Otherwise -- Austin, TX >>http://www.otherwise.com >> >> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org >> >> >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org