pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extraction problems with PDFTextStripperByArea
Date Mon, 20 Jul 2015 15:59:02 GMT
Hi,

This is all very confusing, and partly because of trouble with your mail 
software. I think we had this before, that code parts simply vanished.

I looked at the file with PDFDebugger. Your mediabox is different in the 
pages. Maybe that is a possible cause?

Tilman


Am 20.07.2015 um 17:37 schrieb Pierre Dubillot:
> I did a short example with the same problem :
>
>
> PDFTextStripperByArea cellules;
> public ArrayList<String> scanCells(PDDocument pdf, int pagenum) throws
>>> IOException{
>> ArrayList<String> columnData = new ArrayList<String>();
>>
>> cellules.extractRegions(pdf.getPage(pagenum));
>>
>> cellules.setStartPage(pagenum);
>>
>> System.out.println("Page : " +  cellules.getStartPage() + " - " +
>>> cellules.getEndPage());
>> cellules.setSortByPosition(true);
>>
>> for(String region : cellules.getRegions()){
>>
>> // PDDocument pdoc = new PDDocument();
>>
>> // pdoc.addPage(page);
>>
>> // pdoc.save(new File("D:/Stage_DUT/pdfs/testing/" + region + ".pdf"));
>>
>> // pdoc.close();
>>
>> // System.out.println(cellules.getTextForRegion(region));
>>
>> columnData.add(cellules.getTextForRegion(region));
>>
>> System.out.println(columnData);
>>
>> }
>>
>> return columnData;
>>
>> }
>>
>>
> Each region name is different (0page0, 0page1, 0page2, etc .. '0' is the
> column number for the current page.) , and the method is called inside the
> column object (which is created 7 times for 7 pages, in order to adjust
> position of each column).
>
> 2015-07-20 11:30 GMT+02:00 Pierre Dubillot <alexcouter@gmail.com>:
>
>> Hi,
>> I'm about to finalize my project (which will extract areas of text), but,
>> here's my problem :
>>
>> I've got a document (
>> http://www.cinemas-utopia.org/admin/grilles/toulouse/2015-07-21.pdf).
>> Then, I split it into 7 pages (
>> http://www.docdroid.net/FedKhgp/2cac3a9c-b654-41c3-aea2-b473ccbeb06b.pdf.html
>> ).
>> In a XML file, I define which areas i've to extract. (For example : the
>> current date of the page).
>> I create my objects ... Then I process extraction, but it only extracts
>> the last page :
>> "[MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2]"
>> (A list of the result after 7 pages extracted).
>>
>> I've to explain my code a little bit. First, I did a local app using
>> 1.8.9, then I switched to a Web App and used 2.0.0, so, I changed my code a
>> little bit, and everything seems to stop working.
>>
>> There are 3 main objects to process extraction : Line or Column and Cell.
>> A line/column can contain cells.
>>
>> public Cell(String zoneName, double x, double y, int largeur, int hauteur)
>>> throws IOException {
>>> super();
>>> [this ...]
>>> uneCellule = new PDFTextStripperByArea();
>>> uneCellule.addRegion(nomZone, new Rectangle2D.Double(x, y, largeur,
>>> hauteur));
>>> }
>>>
>> And, for example, a column will create attached cells.
>>
>>
>>
>> public Colonne(
>>> int index,
>>> String columnName,
>>> double x,
>>> double y,double hauteurCellule, double hauteurColonne, double
>>> largeurColonne, double decalage) throws IOException
>>> {
>>> [...]
>>> for(double i = y; i < (hauteurColonne + y); i+=(hauteurCellule+decalage)){
>>> cells.add(
>>> new Cell(
>>> nameOfCells.get(compteur),
>>> x,
>>> i,
>>> (int) largeurColonne,
>>> (int) hauteurCellule
>>> )
>>> );
>>> compteur++;
>>> }
>>> }
>>
>> And my code to extract text (in my column object) :
>>> public ArrayList<String> scanCells(PDPage page) throws IOException{
>>> ArrayList<String> columnData= new ArrayList<String>();
>>> for(Cell cell: cells){
>>> cell.extractRegion(page);
>>> try {
>>> columnData.add(getTextOfCell(cell));
>>> } catch (ParseException e) {
>>> e.printStackTrace();
>>> }
>>> }
>>> return columnData;
>>> }
>>
>> My documents are correctly loaded, each page is inside memory, I can't
>> find on my own how to solve this, it's a strange problem. If you have any
>> idea (how I could update my code)..
>>
>> Thanks,
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message