Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 67844 invoked from network); 24 Aug 2002 14:55:34 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 24 Aug 2002 14:55:34 -0000 Received: (qmail 26810 invoked by uid 97); 24 Aug 2002 14:56:02 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 26743 invoked by uid 97); 24 Aug 2002 14:56:01 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 26731 invoked by uid 98); 24 Aug 2002 14:56:01 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Date: Sat, 24 Aug 2002 15:54:57 +0100 (BST) From: Keith Gunn To: Lucene Users List Subject: Re: Parsers In-Reply-To: <3D671049.2030000@robosoftin.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=-3.4, required 5, IN_REP_TO) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Although no-one else seems to have come across any problems the HTML parser that came with lucene did not operate efficiently enough for me so I found an alterative from http://htmlparser.sourceforge.net the Java API is all you need to parse text files. for PDF several APIs are available, I recommend www.pdfbox.org i had no luck in finding API's for msword or rtf. but there are plenty tools that can do the job. On Sat, 24 Aug 2002, Pradeep Kumar K wrote: > Hi friends > > I need parsers for the following file formats > 1. HTML > 2. PDF > 3. MSWord > 4. RTF > 4. Simple text > > Do any body developed parsers( in java) for all/any of the file formats? > If you have please tell me the links so that I can download. > > Thanks in Advance > Pradeep > > > -------------------------------------------------------------- > Robosoft Technologies - Partners in Product Development > > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > > -- To unsubscribe, e-mail: For additional commands, e-mail: