pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: How to extract pdf content from a html page
Date Sat, 26 Aug 2017 11:37:59 GMT
Am 26.08.2017 um 13:33 schrieb Tilman Hausherr:
> Am 26.08.2017 um 12:30 schrieb Aalok Agrawal:
>> I mentioned one approach in my first email, then I mentioned another
>> approach in my last email. Both are not working. I meant to say that. 
>> URL
>> is not public, so won't be able to share. As per your suggestion, I 
>> opened
>> download.pdf in text editor & found that it is not a pdf but a login 
>> page
>> of my site.
>>
>> So I have write a code to pass on credentials, so that it can proceed 
>> with
>> authentication. Is there a way to pass on credentials,using pdfbox API.
>
> No, PDFBox is about doing logins to websites.

Meant to write "PDFBox is NOT about doing logins to websites".

Tilman

>
> Likely this will be by http POST request, you'll need to either 
> analyze the HTML of that website to see what fields are needed, or ask 
> them.
>
> Tilman
>
>
>>
>> On Fri, Aug 25, 2017 at 9:41 PM, Tilman Hausherr <THausherr@t-online.de>
>> wrote:
>>
>>> Am 25.08.2017 um 11:45 schrieb Aalok Agrawal:
>>>
>>>> You got it right, PDF is within a www page. And it's URL is known & 
>>>> passed
>>>> as a variable (strURL) to the function. Another approach which I 
>>>> tried to
>>>> get the content of pdf rendered there, but that is also not working -
>>>>
>>>
>>> Is the URL public and freely available? If yes, please mention it so 
>>> I can
>>> test.
>>>
>>> "but that is also not working" - what does that mean? Do you get an 
>>> error
>>> message, nothing, a JVM crash, a BSOD, ...?
>>>
>>> What is in that "download.pdf" file? Is this a PDF or is it not? 
>>> Does it
>>> start with "%PDF" or not if you open the file with NOTEPAD++?
>>>
>>> If it isn't, then it means that your PDF has a different URL. You'll 
>>> have
>>> to look at the html / javascript source code to find out what is 
>>> going on.
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>>>
>>>> byte[] ba1 = new byte[1024];
>>>> int baLength;
>>>> FileOutputStream fos1 = new FileOutputStream("download.pdf");
>>>> URL url = new URL(strURL);
>>>> URLConnection urlConn = url.openConnection();
>>>>
>>>> InputStream is1 = url.openStream();
>>>>     while ((baLength = is1.read(ba1)) != -1) {
>>>>          fos1.write(ba1, 0, baLength);
>>>>          }
>>>> fos1.flush();
>>>> fos1.close();
>>>> is1.close();
>>>> pdDoc = PDDocument.load("download.pdf");
>>>> parsedText = pdfStripper.getText(pdDoc);
>>>>
>>>> On Fri, Aug 25, 2017 at 12:45 AM, Tilman Hausherr 
>>>> <THausherr@t-online.de>
>>>> wrote:
>>>>
>>>> Am 24.08.2017 um 19:27 schrieb Aalok Agrawal:
>>>>> I have written following code -
>>>>>> PDFTextStripper pdfStripper = null;
>>>>>> PDDocument pdDoc = null;
>>>>>> COSDocument cosDoc = null;
>>>>>> String parsedText = null;
>>>>>>
>>>>>> URL url = new URL(strURL);
>>>>>> BufferedInputStream file = new 
>>>>>> BufferedInputStream(url.openStream());
>>>>>> PDFParser parser = new PDFParser(file);
>>>>>>
>>>>>> parser.parse();
>>>>>> cosDoc = parser.getDocument();
>>>>>> pdfStripper = new PDFTextStripper();
>>>>>>
>>>>>> pdDoc = new PDDocument(cosDoc);
>>>>>> parsedText = pdfStripper.getText(pdDoc);
>>>>>>
>>>>>> But it is not fetching content of pdf embedded in browser.
>>>>>>
>>>>>> PDFBox can't communicate with your browser.
>>>>> url.openStream()
>>>>>
>>>>> means that the URL content is fetched.
>>>>>
>>>>> Could it be that the PDF is within a www page? I.e. HTML outside, 
>>>>> and PDF
>>>>> in a smaller window / frame? Then you'd need to know that URL.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 24, 2017 at 9:08 PM, Gilad Denneboom <
>>>>>> gilad.denneboom@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> If you don't know the file's URL or the path of the local temp 
>>>>>> file to
>>>>>>
>>>>>>> which it is saved I don't see how you could do it.
>>>>>>>
>>>>>>> On Thu, Aug 24, 2017 at 4:08 PM, Aalok Agrawal <aaloka@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>> I am working on an application where pdf is getting rendered
in
>>>>>>>> browser.
>>>>>>>> There is no pdf extension in URL.
>>>>>>>>
>>>>>>>> I have to read the content of the pdf & check some text.
Is 
>>>>>>>> there any
>>>>>>>> way
>>>>>>>> to do that.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Aalok Agrawal
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------

>>>>>>>>
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message