pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Overby <gr...@floorsoft.com>
Subject Re: Save URLs to PDFs?
Date Fri, 05 Nov 2010 23:14:34 GMT
First difference (on second line, first line is for reference point):

bad:
<</Length 1372/E 1779/Filter/FlateDecode/I 1811/L 1795/O 1741/S 1423/T
1676/V 1757>>stream
xÚ?U LSW >?Û O)]Wä!Ô>?"CATl?4PkADy ? ?RjgÊ??< õ A

Start of second line in hex:   78 DA 3F 55 0B 4C 53 57

good:
<</Length 1372/E 1779/Filter/FlateDecode/I 1811/L 1795/O 1741/S 1423/T
1676/V 1757>>stream
xÚ”U LSW >—Û O)]Wä!Ô>˜"CATl”4PkADy ‹ –Rjgʈˆ< õ A

Start of second line in hex:   78 DA 94 55 0B 4C 53 57




Isolated incorrect single characters are throughout the document.
Downloading it multiple times shows consistant errors.


I'll keep thinking on it, but nothing is apparent to me. This shouldn't
happen afaik.


Anyone?

--
Grant Overby
Senior Developer
FloorSoft, Inc.

Often people, especially computer engineers, focus on the machines. They
think, "By doing this, the machine will run faster. By doing this, the
machine will run more effectively. By doing this, the machine will something
something something." They are focusing on machines. But in fact we need to
focus on humans, on how humans care about doing programming or operating the
application of the machines. We are the masters. They are the slaves. --
Yukihiro Matsumoto




On Fri, Nov 5, 2010 at 6:58 PM, Yogesh <yogeshp08@gmail.com> wrote:

> Thanks Grant.
> But I have thousands of PDF URLs like this. I have tried around 12 so far.
> Can all of them be corrupt?
>
> What can I do about this?
>
>
> - Yogesh
>
>
>
>
> On 5 November 2010 18:53, Grant Overby <grant@floorsoft.com> wrote:
>
>> I ran the code [2]. The pdf is corrupted by the code as MD5s are
>> different.
>> File sizes are identical [1];
>>
>> 1:
>> 11/05/2010  06:47 PM         2,371,050 msb201055.pdf
>> 11/05/2010  06:46 PM         2,371,050 My.pdf
>>
>>
>>
>> 2:
>> package s;
>>
>> import java.io.FileWriter;
>> import java.io.InputStream;
>> import java.io.IOException;
>> import java.net.URL;
>> import java.net.URLConnection;
>> import java.net.MalformedURLException;
>>
>> public class Main
>> {
>>  public static void main(String[] args) throws IOException
>>   {
>>    URL url = new URL("
>>
>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2947364/pdf/msb201055.pdf?tool=pmcentrez
>> ");
>>
>>     URLConnection con = url.openConnection();
>>
>>    InputStream in = con.getInputStream();
>>
>>    FileWriter out = new FileWriter("C:/My.pdf");
>>
>>    int next = 0;
>>    while ( ( next = in.read() ) != -1  ) {
>>      out.write(next);
>>    }
>>     out.flush();
>>    out.close();
>>    in.close();
>>   }
>> }
>>
>>
>>
>>
>> --
>> Grant Overby
>> Senior Developer
>> FloorSoft, Inc.
>>
>> Often people, especially computer engineers, focus on the machines. They
>> think, "By doing this, the machine will run faster. By doing this, the
>> machine will run more effectively. By doing this, the machine will
>> something
>> something something." They are focusing on machines. But in fact we need
>> to
>> focus on humans, on how humans care about doing programming or operating
>> the
>> application of the machines. We are the masters. They are the slaves. --
>> Yukihiro Matsumoto
>>
>>
>>
>>
>> On Fri, Nov 5, 2010 at 6:45 PM, <Adam@swmc.com> wrote:
>>
>> > Yogesh,
>> >
>> > Compare the file size and hash (SHA1, MD5, etc.) of the file you
>> download
>> > from your browser with the file that Java downloads.  The end of the
>> file
>> > may be missing when you download it via Java.  I know you said the file
>> > size is correct, but is it the *exact* same number of bytes?  If so,
>> then
>> > the content must be different, and it should just be a matter of running
>> > `diff` on the files to see what's going wrong.
>> >
>> > ----
>> > Thanks,
>> > Adam
>> >
>> >
>> >
>> >
>> >
>> > From:
>> > Yogesh <yogeshp08@gmail.com>
>> > To:
>> > grant@floorsoft.com
>> > Cc:
>> > users@pdfbox.apache.org
>> > Date:
>> > 11/05/2010 15:29
>> > Subject:
>> > Re: Save URLs to PDFs?
>> >
>> >
>> >
>> > Yes. I can download the file through the browser. It works perfectly
>> fine.
>> >
>> > - Yogesh
>> >
>> >
>> >
>> > On 5 November 2010 18:25, Grant Overby <grant@floorsoft.com> wrote:
>> >
>> > > If you download the file through a browser? Does it work then?
>> > >
>> > >
>> > > --
>> > > Grant Overby
>> > > Senior Developer
>> > > FloorSoft, Inc.
>> > >
>> > > Often people, especially computer engineers, focus on the machines.
>> They
>> > > think, "By doing this, the machine will run faster. By doing this, the
>> > > machine will run more effectively. By doing this, the machine will
>> > something
>> > > something something." They are focusing on machines. But in fact we
>> need
>> > to
>> > > focus on humans, on how humans care about doing programming or
>> operating
>> > the
>> > > application of the machines. We are the masters. They are the slaves.
>> --
>> > > Yukihiro Matsumoto
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Nov 5, 2010 at 6:18 PM, Yogesh <yogeshp08@gmail.com> wrote:
>> > >
>> > >> I tried with that, it writes a blank PDF. Though, the file size and
>> the
>> > >> number of pages is correct (for the new written file)
>> > >>
>> > >> - Yogesh
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On 5 November 2010 18:09, Grant Overby <grant@floorsoft.com>
wrote:
>> > >>
>> > >>> You don't need pdfBox to do this. Below is some rough code that
>> allows
>> > >>> you
>> > >>> to download a file and save it.
>> > >>>
>> > >>> URLConnection urlConnection = new URL("http://...");
>> > >>> InputStream   in      = urlConnection.getInputStream();
>> > >>> FileWriter out = new FileWriter("my.pdf");
>> > >>> int next = 0;
>> > >>> while ( ( next = in.read() ) != -1  ) out.write(next);
>> > >>> //close everything
>> > >>>
>> > >>> --
>> > >>> Grant Overby
>> > >>> Senior Developer
>> > >>> FloorSoft, Inc.
>> > >>>
>> > >>> Often people, especially computer engineers, focus on the machines.
>> > They
>> > >>> think, "By doing this, the machine will run faster. By doing this,
>> the
>> > >>> machine will run more effectively. By doing this, the machine will
>> > >>> something
>> > >>> something something." They are focusing on machines. But in fact
we
>> > need
>> > >>> to
>> > >>> focus on humans, on how humans care about doing programming or
>> > operating
>> > >>> the
>> > >>> application of the machines. We are the masters. They are the
>> slaves.
>> > --
>> > >>> Yukihiro Matsumoto
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Fri, Nov 5, 2010 at 5:56 PM, Yogesh <yogeshp08@gmail.com>
wrote:
>> > >>>
>> > >>> > Hi,
>> > >>> >
>> > >>> > I have PDFs which I can access through URLs. I want to download
>> and
>> > >>> save it
>> > >>> > to files. How can I go about it?
>> > >>> >
>> > >>> > Thanks
>> > >>> >
>> > >>> > -Yogesh
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>> >
>> >
>> > - FHA 203b; 203k; HECM; VA; USDA; Conventional
>> > - Warehouse Lines; FHA-Authorized Originators
>> > - Lending and Servicing in over 45 States
>> > www.swmc.com   -  www.simplehecmcalculator.com   Visit
>> > www.swmc.com/resources   for helpful links on Training, Webinars,
>> Lender
>> > Alerts and Submitting Conditions
>> > This email and any content within or attached hereto from Sun West
>> Mortgage
>> > Company, Inc. is confidential and/or legally privileged. The information
>> is
>> > intended only for the use of the individual or entity named on this
>> email..
>> > If you are not the intended recipient, you are hereby notified that any
>> > disclosure, copying, distribution or taking any action in reliance on
>> the
>> > contents of this email information is strictly prohibited, and that the
>> > documents should be returned to this office immediately by email.
>> Receipt by
>> > anyone other than the intended recipient is not a waiver of any
>> privilege.
>> > Please do not include your social security number, account number, or
>> any
>> > other personal or financial information in the content of the email.
>> Should
>> > you have any questions, please call (800) 453 7884.  =
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message