pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <csr198...@sina.com>
Subject ask for help
Date Mon, 23 Mar 2015 08:52:20 GMT
Dear sir/madam
I'm a chinese student. I want to use PDFbox to do some research in PDF extraction.
Now the most important thing for me is to extract the structurual information from PDFs. I
know PDFbox is very powerfull. But  I do not know how to extract the information from a pdf.
I've extract the plain txt from a pdf using PDFbox. And the plain txt can't satisfy my demand.
For natural language processing, I need parsing the PDF, so I should not only extract the
txt information, but also get the PDF's structure that means I should get the all the tags
like Tj、Tm in a PDF. PDFbox has lots of APIs, I don't know how to get the value from every
tag of each PDFobject. I know in PDF some tags in it, just like Tj、Tm and so on. I hope
get every PDFobject's structural information just like font、fontsize and so on, so I can
obtain some pattern just like the max font, and then I can find the "title" of each paper.
To the object which has the content stream, i hope to decode the stream. Finally, I can abtain
the object's pattern which  has content stream, then I can classify the objects to find which
category I need.
Do you think its possible?
Could you give me some example to extract PDF, specially the extraction the object with stream,
find max font-size object and decode the stream. I hope you can provide me some source codes
extracting pdfs using PDFbox. Not just stripper.getText().
Thanks a billion!!! I hope you write to me soon!!!
sincerely,
 
dock CHEN
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message