pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to extract pdf content from a html page
Date Sat, 26 Aug 2017 11:33:42 GMT
Am 26.08.2017 um 12:30 schrieb Aalok Agrawal:
> I mentioned one approach in my first email, then I mentioned another
> approach in my last email. Both are not working. I meant to say that. URL
> is not public, so won't be able to share. As per your suggestion, I opened
> download.pdf in text editor & found that it is not a pdf but a login page
> of my site.
>
> So I have write a code to pass on credentials, so that it can proceed with
> authentication. Is there a way to pass on credentials,using pdfbox API.

No, PDFBox is about doing logins to websites.

Likely this will be by http POST request, you'll need to either analyze 
the HTML of that website to see what fields are needed, or ask them.

Tilman


>
> On Fri, Aug 25, 2017 at 9:41 PM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 25.08.2017 um 11:45 schrieb Aalok Agrawal:
>>
>>> You got it right, PDF is within a www page. And it's URL is known & passed
>>> as a variable (strURL) to the function. Another approach which I tried to
>>> get the content of pdf rendered there, but that is also not working -
>>>
>>
>> Is the URL public and freely available? If yes, please mention it so I can
>> test.
>>
>> "but that is also not working" - what does that mean? Do you get an error
>> message, nothing, a JVM crash, a BSOD, ...?
>>
>> What is in that "download.pdf" file? Is this a PDF or is it not? Does it
>> start with "%PDF" or not if you open the file with NOTEPAD++?
>>
>> If it isn't, then it means that your PDF has a different URL. You'll have
>> to look at the html / javascript source code to find out what is going on.
>>
>> Tilman
>>
>>
>>
>>
>>
>>> byte[] ba1 = new byte[1024];
>>> int baLength;
>>> FileOutputStream fos1 = new FileOutputStream("download.pdf");
>>> URL url = new URL(strURL);
>>> URLConnection urlConn = url.openConnection();
>>>
>>> InputStream is1 = url.openStream();
>>>     while ((baLength = is1.read(ba1)) != -1) {
>>>          fos1.write(ba1, 0, baLength);
>>>          }
>>> fos1.flush();
>>> fos1.close();
>>> is1.close();
>>> pdDoc = PDDocument.load("download.pdf");
>>> parsedText = pdfStripper.getText(pdDoc);
>>>
>>> On Fri, Aug 25, 2017 at 12:45 AM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Am 24.08.2017 um 19:27 schrieb Aalok Agrawal:
>>>> I have written following code -
>>>>> PDFTextStripper pdfStripper = null;
>>>>> PDDocument pdDoc = null;
>>>>> COSDocument cosDoc = null;
>>>>> String parsedText = null;
>>>>>
>>>>> URL url = new URL(strURL);
>>>>> BufferedInputStream file = new BufferedInputStream(url.openStream());
>>>>> PDFParser parser = new PDFParser(file);
>>>>>
>>>>> parser.parse();
>>>>> cosDoc = parser.getDocument();
>>>>> pdfStripper = new PDFTextStripper();
>>>>>
>>>>> pdDoc = new PDDocument(cosDoc);
>>>>> parsedText = pdfStripper.getText(pdDoc);
>>>>>
>>>>> But it is not fetching content of pdf embedded in browser.
>>>>>
>>>>> PDFBox can't communicate with your browser.
>>>> url.openStream()
>>>>
>>>> means that the URL content is fetched.
>>>>
>>>> Could it be that the PDF is within a www page? I.e. HTML outside, and PDF
>>>> in a smaller window / frame? Then you'd need to know that URL.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> On Thu, Aug 24, 2017 at 9:08 PM, Gilad Denneboom <
>>>>> gilad.denneboom@gmail.com>
>>>>> wrote:
>>>>>
>>>>> If you don't know the file's URL or the path of the local temp file to
>>>>>
>>>>>> which it is saved I don't see how you could do it.
>>>>>>
>>>>>> On Thu, Aug 24, 2017 at 4:08 PM, Aalok Agrawal <aaloka@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> I am working on an application where pdf is getting rendered
in
>>>>>>> browser.
>>>>>>> There is no pdf extension in URL.
>>>>>>>
>>>>>>> I have to read the content of the pdf & check some text.
Is there any
>>>>>>> way
>>>>>>> to do that.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Aalok Agrawal
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message