Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm
Date: Thu, 3 Jul 2003 17:08:32 -0700 (PDT)
From: Parker Thompson <parkert@archive.org>
To: POI Users List <poi-user@jakarta.apache.org>
Subject: Re: extracting hrefs
In-Reply-To: <Pine.LNX.4.33.0307031040270.3442-100000@homeserver.archive.org>
Message-ID: <Pine.LNX.4.33.0307031706210.3442-100000@homeserver.archive.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O

FYI, I was making this harder than it needed be.  I wanted to post my
solution here for anyone who wanted it, and am also taking any suggestions
for improvement:

public class PoiDocParser {

	public static void main(String[] args){

		Writer out = new StringWriter();
		
		WordDocument w;
		try{
			w = new WordDocument("/home/parkert/test.doc");
			w.writeAllText(out);

		}catch(IOException e){
			e.printStackTrace();
		}
		
		String page = out.toString();
		
		int currentPos = -1;
		int linkStart = -1;
		int linkEnd = -1;
		char quote = '\"';
		
		currentPos = page.indexOf("HYPERLINK");
		while(currentPos >= 0){

			linkStart = page.indexOf(quote, currentPos) + 1;
			linkEnd = page.indexOf(quote, linkStart);
			
			String hyperlink = page.substring(linkStart, 
linkEnd);
			
			System.out.println("link: '" + hyperlink + "'");	
			
			currentPos = page.indexOf("HYPERLINK", linkEnd + 
1);
		}
	}
}

later,

pt.
-- 
Parker Thompson
The Internet Archive
510.541.0125

On Thu, 3 Jul 2003, Parker Thompson wrote:

|Hello,
|
|I am trying to figure out whether POI's HDF stuff will do what I need and 
|am hoping someone here has some experience/insight.
|
|Background: I'm working on a web crawler in java and we're hoping to be
|able to get links out of word documents (among others).  Our primary
|concern is coverage, we want to get everything, but we are also concerned
|about efficiency to a lesser degree.
|
|My basic question, and I apologize that it's not more specific (I blame it
|on the scant javadocs), is whether the hdf stuff is well-suited for this
|at all, and even if it is, whether it might be overkill.  For example, it
|seems like the java equivalent of 'strings <file>' and a regexp might be
|good enough, but this might miss things like relative links.
|
|In the best-case I'd have a class/classes that allowed me to fetch an
|array of all URIs in a word doc, which I could then iterate through.
|
|Thanks in advance for any suggestions,
|
|pt.
|