pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Extracting vector graphics from pdf
Date Mon, 27 Feb 2017 12:20:53 GMT
PDFBox Colleagues,
  Any recommendations?



-----Original Message-----
From: Andisa Dewi [mailto:theknights91@yahoo.com] 
Sent: Monday, February 27, 2017 5:32 AM
To: user@tika.apache.org
Subject: Extracting vector graphics from pdf

Hello guys,

I'm currently extracting images from a whole lot of pdf files, however some of images (or
figures) are somehow not extracted. I'm thinking it might have to do with the fact that those
images are vector graphics (as usually the case in a lot of scientific papers). My question
is, is it possible to extract vector graphics from pdfs using Tika?

I attached an example of the pdf (here for example, all images are extracted except Figure

The way I'm extracting the images are the same as in the example code:

Parser parser = new AutoDetectParser();
Metadata m = new Metadata();
ParseContext c = new ParseContext();
ContentHandler h = new BodyContentHandler(-1); PDFParserConfig pdfConfig = new PDFParserConfig();
c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); EmbeddedDocumentExtractor
ex = new MyEmbeddedDocumentExtractor(c); c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream,
h, m, c);




View raw message