From Parker Thompson <park...@archive.org>
Subject Re: extracting hrefs
Date Fri, 04 Jul 2003 00:08:32 GMT
FYI, I was making this harder than it needed be.  I wanted to post my
solution here for anyone who wanted it, and am also taking any suggestions
for improvement:

public class PoiDocParser {

	public static void main(String[] args){

		Writer out = new StringWriter();
		WordDocument w;
			w = new WordDocument("/home/parkert/test.doc");

		}catch(IOException e){
		String page = out.toString();
		int currentPos = -1;
		int linkStart = -1;
		int linkEnd = -1;
		char quote = '\"';
		currentPos = page.indexOf("HYPERLINK");
		while(currentPos >= 0){

			linkStart = page.indexOf(quote, currentPos) + 1;
			linkEnd = page.indexOf(quote, linkStart);
			String hyperlink = page.substring(linkStart, 
			System.out.println("link: '" + hyperlink + "'");	
			currentPos = page.indexOf("HYPERLINK", linkEnd + 


Parker Thompson
The Internet Archive

On Thu, 3 Jul 2003, Parker Thompson wrote:

|I am trying to figure out whether POI's HDF stuff will do what I need and 
|am hoping someone here has some experience/insight.
|Background: I'm working on a web crawler in java and we're hoping to be
|able to get links out of word documents (among others).  Our primary
|concern is coverage, we want to get everything, but we are also concerned
|about efficiency to a lesser degree.
|My basic question, and I apologize that it's not more specific (I blame it
|on the scant javadocs), is whether the hdf stuff is well-suited for this
|at all, and even if it is, whether it might be overkill.  For example, it
|seems like the java equivalent of 'strings <file>' and a regexp might be
|good enough, but this might miss things like relative links.
|In the best-case I'd have a class/classes that allowed me to fetch an
|array of all URIs in a word doc, which I could then iterate through.
|Thanks in advance for any suggestions,

