Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 11531 invoked from network); 4 Dec 2005 00:22:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 4 Dec 2005 00:22:50 -0000 Received: (qmail 78512 invoked by uid 500); 4 Dec 2005 00:22:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78491 invoked by uid 500); 4 Dec 2005 00:22:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78480 invoked by uid 99); 4 Dec 2005 00:22:44 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Dec 2005 16:22:44 -0800 X-ASF-Spam-Status: No, hits=1.9 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_POST,MSGID_FROM_MTA_HEADER,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of mark_benussi@hotmail.com designates 65.54.162.77 as permitted sender) Received: from [65.54.162.77] (HELO hotmail.com) (65.54.162.77) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Dec 2005 16:22:43 -0800 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Sat, 3 Dec 2005 16:22:23 -0800 Message-ID: Received: from 86.131.69.35 by BAY108-DAV5.phx.gbl with DAV; Sun, 04 Dec 2005 00:22:23 +0000 X-Originating-IP: [86.131.69.35] X-Originating-Email: [mark_benussi@hotmail.com] X-Sender: mark_benussi@hotmail.com From: "Mark Benussi" To: Subject: RE: best html parser for html documents generated by microsoft products Date: Sun, 4 Dec 2005 00:22:09 -0000 Message-ID: <000701c5f868$c3bedd40$0201a8c0@episys.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2670 Thread-Index: AcX4EF39Nj4csErgQ76EngvptHUdxQAWEYig In-Reply-To: <4391A249.10508@artentis.com> X-OriginalArrivalTime: 04 Dec 2005 00:22:23.0438 (UTC) FILETIME=[CC189EE0:01C5F868] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I use JTidy also, but not for Lucene parsing. There is no easy way of handling this, you simply have to remove all crappy Microsoft inserts as they come. -----Original Message----- From: Gaston [mailto:gasi@artentis.com] Sent: 03 December 2005 13:49 To: java-user@lucene.apache.org Subject: best html parser for html documents generated by microsoft products Hallo, JTidy is a very good HTMLParser but for HTML Websites made with the help of Microssoft Office Products like Word for example it is not optimal. Because ist returns "Microsoft specific HTML Tags" instead of only text. Or as should I handle HTML Pages with source begins so " " like XML Files and using a XML -Parser instead of a HTML-Parser? I think it should be a HTML page because of "" I am glad for every kind Greetings Gaston --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org