pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Get content of a specific object
Date Thu, 27 Aug 2015 20:11:12 GMT
Am 27.08.2015 um 20:17 schrieb Roberto Nibali:
> Hi Maruan
>
>> And again thanks heaps for your suggestion. It pointed me exactly towards
>>> the right direction. Solved it using the following code:
>>>
>>> PDFTextStripper pdfTextStripper = new PDFTextStripper();
>>> String text = pdfTextStripper.getText(srcDoc);
>>> String textNormalized = text.replaceAll("\\n", " ").replaceAll("\\s{2}",
>> " ");
>>> List<String> metaData = getMetaData(textNormalized);
>>> metaData.forEach(s -> System.out.printf("%s = %s%n", s.split("=")));
>>>
>>> public static List<String> getMetaData(String largeText){
>>>     Pattern pattern = Pattern.compile("\\$\\$.*=.*\\s");
>>>     Matcher mtch = pattern.matcher(largeText);
>>>     List<String> entries = new ArrayList<>();
>>>     while (mtch.find()) {
>>>         entries.add(mtch.group());
>>>     }
>>>     return entries;
>>> }
>>>
>>> Works like a charm!
>>>
>>> Question: Would it be possible to extract the text only from one page
>> (the
>>> first one) via the PDFTextStripper API?
>> you can use PDFTextStripper.setStartPage() and PDFTextStripper.setEndPage()
>>
>>
> Indeed, and it works wonderfully. Now, I know why PDFTextStripper has all
> those methods ;). Why not just convert the class into a Builder pattern?
> Anyway, it works for my case. Strangely enough the API of PDFTextStripper
> starts with page 1 as index 1, while PDDocument getPage() uses index 0 as
> page 1.

Yes.... too late to change that now.

> I also did not figure out the semantics of setParagraphStart(String ...).

I suspect it is for derived classes, e.g. PDFText2HTML

Tilman




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message