pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Hello, I have a question in extracting Texts from PDF file.
Date Wed, 18 May 2016 07:11:08 GMT
Am 18.05.2016 um 04:21 schrieb Kay_Lee:
> Hello,
>   
> I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting
Texts from PDF files.
> Name: Su-Sang, Lee (English name: Kay Lee)
> Cell Phone: +82-10-3180-7976
> Residence: Seoul, South Korea, Asia
> E-mail: herurider@hotmail.com (or herurider@gmail.com)
>   
> My software development environment is,
>   
> Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library
for .NET binaries, available as Nuget pacakage.)
>   
> I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.
>   
> However, what I concern most is that PDFBox takes little bit longer time in extracting
than iTextSharp and other competitors.
>   
> What I need is only extracting Korean Text from PDF file and no more purposes.
>
> I tried to research on internet like google and stackoverflow but no specific solution
and limited cases.
>
> 1) How can I extract text faster?

You can't. Unless you have a "turbo" or "nitro" button on the computer.

make sure you opening the files as files and not as streams. But I see 
below, you already do that, i.e. your code is good.

> 2) And do I need all the library wtih more than 30 MB files, if I only need to extract
Texts ?

Of PDFBox itself, you need  pdfbox and fontbox and logging. If files are 
encrypted, then also bouncy castle. You won't need xmp and the image 
libraries. See also here
https://pdfbox.apache.org/1.8/dependencies.html

> If I only need some specific dll library files among all PDFBOX dll library files, could
you please kindly let me know which ones ?
>
> 3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 and 2.0.1.

indeed. However there is no official .net release, i.e. none of the 
"very active developers" is currently using that one (an older release 
is here: http://pdfbox.lehmi.de/ ). And I doubt they will be faster. 
However they'll extract better.

There is a guide from 2012 to create the dlls:
https://web.archive.org/web/20120204060917/http://pdfbox.apache.org/userguide/dot_net.html
but I don't know if it works.

See also this: http://www.squarepdf.net/pdfbox-in-net
https://stackoverflow.com/questions/8441991/how-to-build-pdfbox-for-net

>   
> I don't belong to any company and organization but just a private person and developing
a software to be distributed and used for free for 5 years as public profit purpose. As my
major is not software-related but just bio-chemistry, please understand kindly and explain
me in detail as possible as you'd be able.

If you're non profit and willing to distribute the source code, you can 
use iText, see here: http://itextpdf.com/AGPL

>
> My simple code to extract Text from PDF file is,
>
> internal static string ExtractTextFromPdf(string path)
>          {
>              PDDocument doc = null;
>              try
>              {
>                  doc = PDDocument.load(path);
>                  PDFTextStripper stripper = new PDFTextStripper();
>                  stripper.setSuppressDuplicateOverlappingText(false);
>                  return stripper.getText(doc);
>              }
>              finally
>              {
>                  if (doc != null)
>                  {
>                      doc.close();
>                  }
>              }
>          }

Yes that code is fine.

Tilman

>   
> Hope kind and excellent support.
>
> Thank you so much !
>
> Mr. Su-Sang, Lee (Kay Lee)
> +82-10-3180-7976
> herurider@hotmail.com
>   
>   		 	   		



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message