pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay_Lee <heruri...@hotmail.com>
Subject Hello, I have a question in extracting Texts from PDF file.
Date Wed, 18 May 2016 02:21:00 GMT
Hello,
 
I'm living in South Korea in Far-East Asia and I'm usinig Apache PDFBox in extracting Texts
from PDF files.
Name: Su-Sang, Lee (English name: Kay Lee)
Cell Phone: +82-10-3180-7976
Residence: Seoul, South Korea, Asia
E-mail: herurider@hotmail.com (or herurider@gmail.com)
 
My software development environment is,
 
Windows10, Visual Studio2015, C#, PDFBox version 1.1.1(Build of Apache PDFBOX library for
.NET binaries, available as Nuget pacakage.)
 
I can extract Texts (our Korean language) from PDF file with many thanks to Apache Foundation.
 
However, what I concern most is that PDFBox takes little bit longer time in extracting than
iTextSharp and other competitors.
 
What I need is only extracting Korean Text from PDF file and no more purposes.

I tried to research on internet like google and stackoverflow but no specific solution and
limited cases.

1) How can I extract text faster?
 
2) And do I need all the library wtih more than 30 MB files, if I only need to extract Texts
?
If I only need some specific dll library files among all PDFBOX dll library files, could you
please kindly let me know which ones ?

3) Is it still ok to use PDFBOX 1.1.1 ? There seems recent versions like 1.8.12 and 2.0.1.
 
I don't belong to any company and organization but just a private person and developing a
software to be distributed and used for free for 5 years as public profit purpose. As my major
is not software-related but just bio-chemistry, please understand kindly and explain me in
detail as possible as you'd be able.

My simple code to extract Text from PDF file is,

internal static string ExtractTextFromPdf(string path)
        {
            PDDocument doc = null;
            try
            {
                doc = PDDocument.load(path);
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setSuppressDuplicateOverlappingText(false);
                return stripper.getText(doc);
            }
            finally
            {
                if (doc != null)
                {
                    doc.close();
                }
            }
        }
 
Hope kind and excellent support.

Thank you so much !

Mr. Su-Sang, Lee (Kay Lee)
+82-10-3180-7976
herurider@hotmail.com
 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message