pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay_Lee <heruri...@hotmail.com>
Subject RE: Hello, I have a question in extracting Texts from PDF file.
Date Tue, 24 May 2016 01:39:39 GMT
Dear Mr. Tilman Hausherr, 
 
Please kindly accept my deep apology.
 
And I cordially thank your quick and excellent, delightful answer.
 
So far, I analyzed only the link to stackoverflow but will check all the link suggested by
you.
 
My major is not related to software but just bio-chemistry and I'm finalizing the development
of my application these days.
Therefore, I must take care of from A to Z, a millions of matters....I've been really hectic.
Please kindly understand.
 
While I didn't fully check all the link from you, but it doesn't make sense I need all the
many dll files to only extract text from PDF.
(But I'm really satisfied with the quality of PDFBox)
 
Hope you can also develop a 'nitro turbo' button as a library(.dll)

 
Again, my deepest appreciation to you.
 
All the best !
 
Truthfully yours,

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
herurider@hotmail.com

 
> Subject: Re: Hello, I have a question in extracting Texts from PDF file.
> To: users@pdfbox.apache.org
> From: THausherr@t-online.de
> Date: Wed, 18 May 2016 09:11:08 +0200
> 
> Am 18.05.2016 um 04:21 schrieb Kay_Lee:
> > Hello,
> >   
> > I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting
Texts from PDF files.
> > Name: Su-Sang, Lee (English name: Kay Lee)
> > Cell Phone: +82-10-3180-7976
> > Residence: Seoul, South Korea, Asia
> > E-mail: herurider@hotmail.com (or herurider@gmail.com)
> >   
> > My software development environment is,
> >   
> > Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library
for .NET binaries, available as Nuget pacakage.)
> >   
> > I can extract Texts (our Korean language) from PDF file with many thanks to Apache
Foundation.
> >   
> > However, what I concern most is that PDFBox takes little bit longer time in extracting
than iTextSharp and other competitors.
> >   
> > What I need is only extracting Korean Text from PDF file and no more purposes.
> >
> > I tried to research on internet like google and stackoverflow but no specific solution
and limited cases.
> >
> > 1) How can I extract text faster?
> 
> You can't. Unless you have a "turbo" or "nitro" button on the computer.
> 
> make sure you opening the files as files and not as streams. But I see 
> below, you already do that, i.e. your code is good.
> 
> > 2) And do I need all the library wtih more than 30 MB files, if I only need to extract
Texts ?
> 
> Of PDFBox itself, you need  pdfbox and fontbox and logging. If files are 
> encrypted, then also bouncy castle. You won't need xmp and the image 
> libraries. See also here
> https://pdfbox.apache.org/1.8/dependencies.html
> 
> > If I only need some specific dll library files among all PDFBOX dll library files,
could you please kindly let me know which ones ?
> >
> > 3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12
and 2.0.1.
> 
> indeed. However there is no official .net release, i.e. none of the 
> "very active developers" is currently using that one (an older release 
> is here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster. 
> However they'll extract better.
> 
> There is a guide from 2012 to create the dlls:
> https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
> but I don't know if it works.
> 
> See also this: http://www.squarepdf.net/pdfbox-in-net
> https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net
> 
> >   
> > I don't belong to any company and organization but just a private person and developing
a software to be distributed and used for free for 5 years as public profit purpose. As my
major is not software-related but just bio-chemistry, please understand kindly and explain
me in detail as possible as you'd be able.
> 
> If you're non profit and willing to distribute the source code, you can 
> use iText, see here: http://itextpdf.com/AGPL
> 
> >
> > My simple code to extract Text from PDF file is,
> >
> > internal static string ExtractTextFromPdf(string path)
> >          {
> >              PDDocument doc = null;
> >              try
> >              {
> >                  doc = PDDocument.load(path);
> >                  PDFTextStripper stripper = new PDFTextStripper();
> >                  stripper.setSuppressDuplicateOverlappingText(false);
> >                  return stripper.getText(doc);
> >              }
> >              finally
> >              {
> >                  if (doc != null)
> >                  {
> >                      doc.close();
> >                  }
> >              }
> >          }
> 
> Yes that code is fine.
> 
> Tilman
> 
> >   
> > Hope kind and excellent support.
> >
> > Thank you so much !
> >
> > Mr. Su-Sang, Lee (Kay Lee)
> > +82-10-3180-7976
> > herurider@hotmail.com
> >   
> >   		 	   		
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message