Return-Path: Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list poi-user@jakarta.apache.org Received: (qmail 27593 invoked from network); 4 Jul 2003 00:08:24 -0000 Received: from homeserver.archive.org (209.237.233.202) by daedalus.apache.org with SMTP; 4 Jul 2003 00:08:24 -0000 Received: from localhost (parkert@localhost) by homeserver.archive.org (8.11.6/8.11.6) with ESMTP id h6408W619417 for ; Thu, 3 Jul 2003 17:08:33 -0700 X-Authentication-Warning: homeserver.archive.org: parkert owned process doing -bs Date: Thu, 3 Jul 2003 17:08:32 -0700 (PDT) From: Parker Thompson To: POI Users List Subject: Re: extracting hrefs In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Status: O X-Status: X-Keywords: X-UID: 20 FYI, I was making this harder than it needed be. I wanted to post my solution here for anyone who wanted it, and am also taking any suggestions for improvement: public class PoiDocParser { public static void main(String[] args){ Writer out = new StringWriter(); WordDocument w; try{ w = new WordDocument("/home/parkert/test.doc"); w.writeAllText(out); }catch(IOException e){ e.printStackTrace(); } String page = out.toString(); int currentPos = -1; int linkStart = -1; int linkEnd = -1; char quote = '\"'; currentPos = page.indexOf("HYPERLINK"); while(currentPos >= 0){ linkStart = page.indexOf(quote, currentPos) + 1; linkEnd = page.indexOf(quote, linkStart); String hyperlink = page.substring(linkStart, linkEnd); System.out.println("link: '" + hyperlink + "'"); currentPos = page.indexOf("HYPERLINK", linkEnd + 1); } } } later, pt. -- Parker Thompson The Internet Archive 510.541.0125 On Thu, 3 Jul 2003, Parker Thompson wrote: |Hello, | |I am trying to figure out whether POI's HDF stuff will do what I need and |am hoping someone here has some experience/insight. | |Background: I'm working on a web crawler in java and we're hoping to be |able to get links out of word documents (among others). Our primary |concern is coverage, we want to get everything, but we are also concerned |about efficiency to a lesser degree. | |My basic question, and I apologize that it's not more specific (I blame it |on the scant javadocs), is whether the hdf stuff is well-suited for this |at all, and even if it is, whether it might be overkill. For example, it |seems like the java equivalent of 'strings ' and a regexp might be |good enough, but this might miss things like relative links. | |In the best-case I'd have a class/classes that allowed me to fetch an |array of all URIs in a word doc, which I could then iterate through. | |Thanks in advance for any suggestions, | |pt. |