pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to extract pdf content from a html page
Date Thu, 24 Aug 2017 19:15:56 GMT
Am 24.08.2017 um 19:27 schrieb Aalok Agrawal:
> I have written following code -
>
> PDFTextStripper pdfStripper = null;
> PDDocument pdDoc = null;
> COSDocument cosDoc = null;
> String parsedText = null;
>
> URL url = new URL(strURL);
> BufferedInputStream file = new BufferedInputStream(url.openStream());
> PDFParser parser = new PDFParser(file);
>
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
>
> pdDoc = new PDDocument(cosDoc);
> parsedText = pdfStripper.getText(pdDoc);
>
> But it is not fetching content of pdf embedded in browser.

PDFBox can't communicate with your browser.

url.openStream()

means that the URL content is fetched.

Could it be that the PDF is within a www page? I.e. HTML outside, and PDF in a smaller window
/ frame? Then you'd need to know that URL.

Tilman


>
> On Thu, Aug 24, 2017 at 9:08 PM, Gilad Denneboom <gilad.denneboom@gmail.com>
> wrote:
>
>> If you don't know the file's URL or the path of the local temp file to
>> which it is saved I don't see how you could do it.
>>
>> On Thu, Aug 24, 2017 at 4:08 PM, Aalok Agrawal <aaloka@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am working on an application where pdf is getting rendered in browser.
>>> There is no pdf extension in URL.
>>>
>>> I have to read the content of the pdf & check some text. Is there any way
>>> to do that.
>>>
>>> Thanks
>>> Aalok Agrawal
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message