pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Baehr <code...@googlemail.com>
Subject Re: Save URLs to PDFs?
Date Fri, 05 Nov 2010 23:18:03 GMT
From the JDK docs:

FileWriter is meant for writing streams of characters. For writing streams of raw bytes, consider
using a FileOutputStream.

You get characters replaced depending on your platforms character encoding. You must ensure
you're writing bytes and not characters!

Michael

On 5. Nov 2010, at 18:14, Grant Overby wrote:

> First difference (on second line, first line is for reference point):
> 
> bad:
> <</Length 1372/E 1779/Filter/FlateDecode/I 1811/L 1795/O 1741/S 1423/T
> 1676/V 1757>>stream
> xÚ?U LSW >?Û O)]Wä!Ô>?"CATl?4PkADy ? ?RjgÊ??< õ A
> 
> Start of second line in hex:   78 DA 3F 55 0B 4C 53 57
> 
> good:
> <</Length 1372/E 1779/Filter/FlateDecode/I 1811/L 1795/O 1741/S 1423/T
> 1676/V 1757>>stream
> xÚ”U LSW >—Û O)]Wä!Ô>˜"CATl”4PkADy ‹ –Rjgʈˆ< õ A
> 
> Start of second line in hex:   78 DA 94 55 0B 4C 53 57
> 
> 
> 
> 
> Isolated incorrect single characters are throughout the document.
> Downloading it multiple times shows consistant errors.
> 
> 
> I'll keep thinking on it, but nothing is apparent to me. This shouldn't
> happen afaik.
> 
> 
> Anyone?
> 
> --
> Grant Overby
> Senior Developer
> FloorSoft, Inc.
> 
> Often people, especially computer engineers, focus on the machines. They
> think, "By doing this, the machine will run faster. By doing this, the
> machine will run more effectively. By doing this, the machine will something
> something something." They are focusing on machines. But in fact we need to
> focus on humans, on how humans care about doing programming or operating the
> application of the machines. We are the masters. They are the slaves. --
> Yukihiro Matsumoto
> 
> 
> 
> 
> On Fri, Nov 5, 2010 at 6:58 PM, Yogesh <yogeshp08@gmail.com> wrote:
> 
>> Thanks Grant.
>> But I have thousands of PDF URLs like this. I have tried around 12 so far.
>> Can all of them be corrupt?
>> 
>> What can I do about this?
>> 
>> 
>> - Yogesh
>> 
>> 
>> 
>> 
>> On 5 November 2010 18:53, Grant Overby <grant@floorsoft.com> wrote:
>> 
>>> I ran the code [2]. The pdf is corrupted by the code as MD5s are
>>> different.
>>> File sizes are identical [1];
>>> 
>>> 1:
>>> 11/05/2010  06:47 PM         2,371,050 msb201055.pdf
>>> 11/05/2010  06:46 PM         2,371,050 My.pdf
>>> 
>>> 
>>> 
>>> 2:
>>> package s;
>>> 
>>> import java.io.FileWriter;
>>> import java.io.InputStream;
>>> import java.io.IOException;
>>> import java.net.URL;
>>> import java.net.URLConnection;
>>> import java.net.MalformedURLException;
>>> 
>>> public class Main
>>> {
>>> public static void main(String[] args) throws IOException
>>>  {
>>>   URL url = new URL("
>>> 
>>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2947364/pdf/msb201055.pdf?tool=pmcentrez
>>> ");
>>> 
>>>    URLConnection con = url.openConnection();
>>> 
>>>   InputStream in = con.getInputStream();
>>> 
>>>   FileWriter out = new FileWriter("C:/My.pdf");
>>> 
>>>   int next = 0;
>>>   while ( ( next = in.read() ) != -1  ) {
>>>     out.write(next);
>>>   }
>>>    out.flush();
>>>   out.close();
>>>   in.close();
>>>  }
>>> }
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Grant Overby
>>> Senior Developer
>>> FloorSoft, Inc.
>>> 
>>> Often people, especially computer engineers, focus on the machines. They
>>> think, "By doing this, the machine will run faster. By doing this, the
>>> machine will run more effectively. By doing this, the machine will
>>> something
>>> something something." They are focusing on machines. But in fact we need
>>> to
>>> focus on humans, on how humans care about doing programming or operating
>>> the
>>> application of the machines. We are the masters. They are the slaves. --
>>> Yukihiro Matsumoto
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Nov 5, 2010 at 6:45 PM, <Adam@swmc.com> wrote:
>>> 
>>>> Yogesh,
>>>> 
>>>> Compare the file size and hash (SHA1, MD5, etc.) of the file you
>>> download
>>>> from your browser with the file that Java downloads.  The end of the
>>> file
>>>> may be missing when you download it via Java.  I know you said the file
>>>> size is correct, but is it the *exact* same number of bytes?  If so,
>>> then
>>>> the content must be different, and it should just be a matter of running
>>>> `diff` on the files to see what's going wrong.
>>>> 
>>>> ----
>>>> Thanks,
>>>> Adam
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> From:
>>>> Yogesh <yogeshp08@gmail.com>
>>>> To:
>>>> grant@floorsoft.com
>>>> Cc:
>>>> users@pdfbox.apache.org
>>>> Date:
>>>> 11/05/2010 15:29
>>>> Subject:
>>>> Re: Save URLs to PDFs?
>>>> 
>>>> 
>>>> 
>>>> Yes. I can download the file through the browser. It works perfectly
>>> fine.
>>>> 
>>>> - Yogesh
>>>> 
>>>> 
>>>> 
>>>> On 5 November 2010 18:25, Grant Overby <grant@floorsoft.com> wrote:
>>>> 
>>>>> If you download the file through a browser? Does it work then?
>>>>> 
>>>>> 
>>>>> --
>>>>> Grant Overby
>>>>> Senior Developer
>>>>> FloorSoft, Inc.
>>>>> 
>>>>> Often people, especially computer engineers, focus on the machines.
>>> They
>>>>> think, "By doing this, the machine will run faster. By doing this, the
>>>>> machine will run more effectively. By doing this, the machine will
>>>> something
>>>>> something something." They are focusing on machines. But in fact we
>>> need
>>>> to
>>>>> focus on humans, on how humans care about doing programming or
>>> operating
>>>> the
>>>>> application of the machines. We are the masters. They are the slaves.
>>> --
>>>>> Yukihiro Matsumoto
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Nov 5, 2010 at 6:18 PM, Yogesh <yogeshp08@gmail.com> wrote:
>>>>> 
>>>>>> I tried with that, it writes a blank PDF. Though, the file size and
>>> the
>>>>>> number of pages is correct (for the new written file)
>>>>>> 
>>>>>> - Yogesh
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 5 November 2010 18:09, Grant Overby <grant@floorsoft.com>
wrote:
>>>>>> 
>>>>>>> You don't need pdfBox to do this. Below is some rough code that
>>> allows
>>>>>>> you
>>>>>>> to download a file and save it.
>>>>>>> 
>>>>>>> URLConnection urlConnection = new URL("http://...");
>>>>>>> InputStream   in      = urlConnection.getInputStream();
>>>>>>> FileWriter out = new FileWriter("my.pdf");
>>>>>>> int next = 0;
>>>>>>> while ( ( next = in.read() ) != -1  ) out.write(next);
>>>>>>> //close everything
>>>>>>> 
>>>>>>> --
>>>>>>> Grant Overby
>>>>>>> Senior Developer
>>>>>>> FloorSoft, Inc.
>>>>>>> 
>>>>>>> Often people, especially computer engineers, focus on the machines.
>>>> They
>>>>>>> think, "By doing this, the machine will run faster. By doing
this,
>>> the
>>>>>>> machine will run more effectively. By doing this, the machine
will
>>>>>>> something
>>>>>>> something something." They are focusing on machines. But in fact
we
>>>> need
>>>>>>> to
>>>>>>> focus on humans, on how humans care about doing programming or
>>>> operating
>>>>>>> the
>>>>>>> application of the machines. We are the masters. They are the
>>> slaves.
>>>> --
>>>>>>> Yukihiro Matsumoto
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Nov 5, 2010 at 5:56 PM, Yogesh <yogeshp08@gmail.com>
wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I have PDFs which I can access through URLs. I want to download
>>> and
>>>>>>> save it
>>>>>>>> to files. How can I go about it?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> -Yogesh
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> - FHA 203b; 203k; HECM; VA; USDA; Conventional
>>>> - Warehouse Lines; FHA-Authorized Originators
>>>> - Lending and Servicing in over 45 States
>>>> www.swmc.com   -  www.simplehecmcalculator.com   Visit
>>>> www.swmc.com/resources   for helpful links on Training, Webinars,
>>> Lender
>>>> Alerts and Submitting Conditions
>>>> This email and any content within or attached hereto from Sun West
>>> Mortgage
>>>> Company, Inc. is confidential and/or legally privileged. The information
>>> is
>>>> intended only for the use of the individual or entity named on this
>>> email..
>>>> If you are not the intended recipient, you are hereby notified that any
>>>> disclosure, copying, distribution or taking any action in reliance on
>>> the
>>>> contents of this email information is strictly prohibited, and that the
>>>> documents should be returned to this office immediately by email.
>>> Receipt by
>>>> anyone other than the intended recipient is not a waiver of any
>>> privilege.
>>>> Please do not include your social security number, account number, or
>>> any
>>>> other personal or financial information in the content of the email.
>>> Should
>>>> you have any questions, please call (800) 453 7884.  =
>>>> 
>>> 
>> 
>> 


Mime
View raw message