pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Spaces are ignored when reading a PDF file
Date Fri, 18 Mar 2016 20:58:19 GMT
This subject of this thread is "Spaces are ignored when reading a PDF file. Please post new
questions to a new thread.

— John

> On 18 Mar 2016, at 04:02, 风云天空 <1010800216@qq.com> wrote:
> 
> who can help me 
> i get this error in multithreading
> java.lang.NullPointerException
> 	at java.awt.color.ICC_Profile.activateDeferredProfile(ICC_Profile.java:1086)
> 	at java.awt.color.ICC_Profile$1.activate(ICC_Profile.java:742)
> 	at sun.java2d.cmm.ProfileDeferralMgr.activateProfiles(ProfileDeferralMgr.java:95)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:775)
> 	at java.awt.color.ICC_Profile.getInstance(ICC_Profile.java:1013)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.loadICCProfile(PDICCBased.java:119)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDICCBased.<init>(PDICCBased.java:89)
> 	at org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace.create(PDColorSpace.java:182)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:172)
> 	at org.apache.pdfbox.pdmodel.PDResources.getColorSpace(PDResources.java:142)
> 	at org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace.process(SetNonStrokingColorSpace.java:41)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:814)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:471)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:445)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:187)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:208)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:139)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:80)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor.getFirstImage(PdfExtractor.java:109)
> 	at com.liaoyoujin.pdfbox.doc.PdfExtractor$Job.run(PdfExtractor.java:178)
> 	at com.liaoyoujin.thread.pool.BlockThreadPool$Worker.run(BlockThreadPool.java:53)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> java.util.ConcurrentModificationException
> 	at java.util.Vector$Itr.checkForComodification(Vector.java:1156)
> 	at java.util.Vector$Itr.next(Vector.java:1133)
> 
> 
> 
> ------------------ 原始邮件 ------------------
> 发件人: "Hesham G.";<heshamgneady@gmail.com>;
> 发送时间: 2016年3月18日(星期五) 下午4:44
> 收件人: "users"<users@pdfbox.apache.org>; 
> 
> 主题: Re: Spaces are ignored when reading a PDF file
> 
> 
> 
>   John,
> 
> I think I have got the idea ... Thumps up 
> 
> 
> Best regards ,
> Hesham 
> 
> ------------------------------------------------------------------------
> Included message :
> 
> I’m rather confused by this thread, inferring spaces is one of the the main  features
of PDFTextStripper. I’m not sure why anyone is suggesting to process  the text manually
- there’s no need to do that. We do that already!
> 
> Looking at the original code the problem is right here:
> 
>> public class PDFTextStripperProcessor extends PDFTextStripper {
>>   @Override
>>   public void processTextPosition( TextPosition text  )  {
>>       System.out.println(  text.getCharacter() );
>>   }
>> }
> 
> The processTextPosition method is used to pass an unprocessed TextPosition  *in* to PDFTextStripper,
but this override prevents that from happening, and is  just printing the unprocessed token
before PDFTextStripper has had a chance to  do its job, such as inferring the missing spaces.
> 
> You should follow our PrintTextLocations.java example which shows you how  to get the
processed TextPositions from PDFTextStripper. It’s really easy to  do.
> 
> — John
> 
>> On 17 Mar 2016, at 04:44, Hesham G. <heshamgneady@gmail.com>  wrote:
>> 
>> Andreas,
>> 
>> You're absolutely right. I am testing it now, but it seems very  complicated. I hope
there might be another easier solution.
>> 
>> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>>> "Hesham G." <heshamgneady@gmail.com> hat am 17. März 2016 um  11:20
>>> geschrieben:
>>> 
>>> 
>>> Andreas,
>>> 
>>> That is very helpful.
>>> 
>>> I can get the x location of each character using  TextPosition.getX(), ex:
>>> W: 102.88399
>>> i: 114.18165
>>> t: 117.660614
>>> h: 121.55801
>>> d: 133.09477
>>> u: 140.3994
>>> e: 147.60838
>>> 
>>> So to detect the space between the 2 words "With" & "due"  should I make
>>> subtraction calculations between X of the last letter(h) and the X  of the
>>> first letter (d) and if the number is large than normal then this  is a
>>> space? I think this way might be risky in the detection, or  what?
>> That's the short story. To decide what is normal could be quite  tricky. You have
>> to take the following facts into account:
>> 
>> - different fonts have different widths (important if the font before  the space
>> isn't the same than the font after the space)
>> - keep in mind that you have to take a scaling and sometimes a  rotation into
>> account
>> - the "space" between characters may vary if the text is  jusitified
>> 
>> There are certainly some other details which may be important as well,  so that
>> you end up with some more or less heuristic.
>> 
>> BR
>> Andreas
>> 
>>> Best regards ,
>>> Hesham
>>> 
>>> ------------------------------------------------------------------------
>>> Included message :
>>> 
>>> Hi,
>>> 
>>>> Frank van der Hulst <drifter.frank@gmail.com> hat am  17. März 2016
um
>>>> 08:34
>>>> geschrieben:
>>>> 
>>>> 
>>>> Spaces don't exist as characters in PDFs. To identify spaces,  you have >
to
>>>> compare the X coordinates of adjacent characters against  their widths.
>>> That's not correct, spaces exist but in most cases pdf engines  omit them and
>>> replace spaces by a splitted text with an appropriate  positioning.
>>> 
>>> BTW, latex uses the same strategy. Here is a excerpt from your  pdf:
>>> 
>>>  [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d)  -383 (to) -383
>>> (Article)
>>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383  (has) -384
>>> (the) -383 (right) ] TJ
>>> 
>>> The text is in between the braces and the numbers are used for  horizontal
>>> positioning.
>>> 
>>> BR
>>> Andreas
>>> 
>>>> 
>>>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  <heshamgneady@gmail.com>
> wrote:
>>>> 
>>>>> Hello ,
>>>>> 
>>>>> I have a PDF file created using Latex. I am trying to  read and print
> > all
>>>>> letters in that file using PDFBox, but when doing this  all spaces in
>>>>> that
>>>>> file are ignored. Here is the code I am using:
>>>>> PDPage page = (PDPage)allPages.get( 0 );
>>>>> PDStream contents = page.getContents();
>>>>> if ( contents != null ) {
>>>>>    PDFTextStripperProcessor  pdfTextStripperProcessor = new
>>>>> PDFTextStripperProcessor();
>>>>>     pdfTextStripperProcessor.processStream( page, > >  page.findResources(),
>>>>> contents.getStream() );
>>>>> }
>>>>> 
>>>>> public class PDFTextStripperProcessor extends  PDFTextStripper {
>>>>>    @Override
>>>>>    public void processTextPosition(  TextPosition text )  {
>>>>>         System.out.println( text.getCharacter() );
>>>>>    }
>>>>> }
>>>>> 
>>>>> And you can check a one page file sample here to test  it:
>>>>> 
>>>>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>>>>> 
>>>>> What is the cause of this issue please?
>>>>> 
>>>>> 
>>>>> Best regards ,
>>>>> Hesham
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail:  users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apach


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message