poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parker Thompson <park...@archive.org>
Subject extracting hrefs
Date Thu, 03 Jul 2003 17:58:35 GMT

I am trying to figure out whether POI's HDF stuff will do what I need and 
am hoping someone here has some experience/insight.

Background: I'm working on a web crawler in java and we're hoping to be
able to get links out of word documents (among others).  Our primary
concern is coverage, we want to get everything, but we are also concerned
about efficiency to a lesser degree.

My basic question, and I apologize that it's not more specific (I blame it
on the scant javadocs), is whether the hdf stuff is well-suited for this
at all, and even if it is, whether it might be overkill.  For example, it
seems like the java equivalent of 'strings <file>' and a regexp might be
good enough, but this might miss things like relative links.

In the best-case I'd have a class/classes that allowed me to fetch an
array of all URIs in a word doc, which I could then iterate through.

Thanks in advance for any suggestions,

Parker Thompson
The Internet Archive

View raw message