nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Manfield <>
Subject crawling PDF file with page links?
Date Wed, 18 May 2005 18:40:20 GMT
Can nutch (with its out-of-box PDFBox plugin) crawl PDF files, where each page is link (e.g.
the URL appends &PGN=pageNumber to go to the specific page)? On the browser, each page
in the pdf file is loaded on demand basis. However when the content is fetched from the URL
(from the code), it looks like all the pages are not fetched. Even when the pdf is saved from
the browser (with Save As, not all pages are saved. The Acrobat Reader is able to open only
1 page and gives errors (cannot find link) for the other pages. Examining the pdf file with
notepad, I did find some tags like GoToR for each page, indicating the destination (in binary
form though) for the page.
Any idea on how to extract everything from the pdf??

Do you Yahoo!?
 Yahoo! Small Business - Try our new resources site! 
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message