tika-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "McGibbney, Lewis John" <Lewis.McGibb...@gcu.ac.uk>
Subject content extraction for pdf links
Date Thu, 20 Jan 2011 12:33:29 GMT
Hello list,

I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages
and associated URL's containing PDF document's. I have then been using Luke v 1.0.1 to look
inside my index to guarantee I have indexed specific PDF documents which reside on these web
pages. When I search my index via my web application interface I am returned a hyperlink (amongst
other information) for a relevant hit. It is my intention to implement a content extraction
mechanism to also provide relevant information contained within the pdf documents which reside
in my index whenever a user submits a query. E.g. if someone were to submit a query relating
to a clause within a legal document, the content extraction tool would parse the pdf file
and provide a snippet of the relevant data from within the PDF document in the search result.

I hope I have explained my problem properly, I am posting here as I have been aware for some
time that Tika was possibly the solution but I am only just getting round to working on this
now.

Thank you

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald
Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Mime
View raw message