Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 80066 invoked from network); 3 Dec 2005 13:49:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Dec 2005 13:49:28 -0000 Received: (qmail 95426 invoked by uid 500); 3 Dec 2005 13:49:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 95405 invoked by uid 500); 3 Dec 2005 13:49:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 95372 invoked by uid 99); 3 Dec 2005 13:49:21 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Dec 2005 05:49:21 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [212.227.126.188] (HELO moutng.kundenserver.de) (212.227.126.188) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Dec 2005 05:50:49 -0800 Received: from [80.184.160.135] (helo=[192.168.2.2]) by mrelayeu.kundenserver.de (node=mrelayeu5) with ESMTP (Nemesis), id 0ML25U-1EiXl40aJB-0008A6; Sat, 03 Dec 2005 14:48:58 +0100 Message-ID: <4391A249.10508@artentis.com> Date: Sat, 03 Dec 2005 14:48:57 +0100 From: Gaston User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: de-DE, de, en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: best html parser for html documents generated by microsoft products Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de login:e856510009f74cf6d851f8e7132753ec X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hallo, JTidy is a very good HTMLParser but for HTML Websites made with the help of Microssoft Office Products like Word for example it is not optimal. Because ist returns "Microsoft specific HTML Tags" instead of only text. Or as should I handle HTML Pages with source begins so " " like XML Files and using a XML -Parser instead of a HTML-Parser? I think it should be a HTML page because of "" I am glad for every kind Greetings Gaston --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org