pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to extract pdf content from a html page
Date Fri, 25 Aug 2017 16:11:28 GMT
Am 25.08.2017 um 11:45 schrieb Aalok Agrawal:
> You got it right, PDF is within a www page. And it's URL is known & passed
> as a variable (strURL) to the function. Another approach which I tried to
> get the content of pdf rendered there, but that is also not working -


Is the URL public and freely available? If yes, please mention it so I 
can test.

"but that is also not working" - what does that mean? Do you get an 
error message, nothing, a JVM crash, a BSOD, ...?

What is in that "download.pdf" file? Is this a PDF or is it not? Does it 
start with "%PDF" or not if you open the file with NOTEPAD++?

If it isn't, then it means that your PDF has a different URL. You'll 
have to look at the html / javascript source code to find out what is 
going on.

Tilman



>
> byte[] ba1 = new byte[1024];
> int baLength;
> FileOutputStream fos1 = new FileOutputStream("download.pdf");
> URL url = new URL(strURL);
> URLConnection urlConn = url.openConnection();
>
> InputStream is1 = url.openStream();
>    while ((baLength = is1.read(ba1)) != -1) {
>         fos1.write(ba1, 0, baLength);
>         }
> fos1.flush();
> fos1.close();
> is1.close();
> pdDoc = PDDocument.load("download.pdf");
> parsedText = pdfStripper.getText(pdDoc);
>
> On Fri, Aug 25, 2017 at 12:45 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 24.08.2017 um 19:27 schrieb Aalok Agrawal:
>>
>>> I have written following code -
>>>
>>> PDFTextStripper pdfStripper = null;
>>> PDDocument pdDoc = null;
>>> COSDocument cosDoc = null;
>>> String parsedText = null;
>>>
>>> URL url = new URL(strURL);
>>> BufferedInputStream file = new BufferedInputStream(url.openStream());
>>> PDFParser parser = new PDFParser(file);
>>>
>>> parser.parse();
>>> cosDoc = parser.getDocument();
>>> pdfStripper = new PDFTextStripper();
>>>
>>> pdDoc = new PDDocument(cosDoc);
>>> parsedText = pdfStripper.getText(pdDoc);
>>>
>>> But it is not fetching content of pdf embedded in browser.
>>>
>> PDFBox can't communicate with your browser.
>>
>> url.openStream()
>>
>> means that the URL content is fetched.
>>
>> Could it be that the PDF is within a www page? I.e. HTML outside, and PDF
>> in a smaller window / frame? Then you'd need to know that URL.
>>
>> Tilman
>>
>>
>>
>>> On Thu, Aug 24, 2017 at 9:08 PM, Gilad Denneboom <
>>> gilad.denneboom@gmail.com>
>>> wrote:
>>>
>>> If you don't know the file's URL or the path of the local temp file to
>>>> which it is saved I don't see how you could do it.
>>>>
>>>> On Thu, Aug 24, 2017 at 4:08 PM, Aalok Agrawal <aaloka@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>> I am working on an application where pdf is getting rendered in browser.
>>>>> There is no pdf extension in URL.
>>>>>
>>>>> I have to read the content of the pdf & check some text. Is there
any
>>>>> way
>>>>> to do that.
>>>>>
>>>>> Thanks
>>>>> Aalok Agrawal
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message