pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Dubillot <alexcou...@gmail.com>
Subject Re: Extraction problems with PDFTextStripperByArea
Date Mon, 20 Jul 2015 16:05:06 GMT
Yes, the method is called over another algorithm, which loops over each
page, and also with a different page number.

Console output :

> Page : 0 -
> [MARJUIN2]
> Page : 1 -
> [MARJUIN2]
> Page : 2 -
> [MARJUIN2]
> Page : 3 -
> [MARJUIN2]
> Page : 4 -
> [MARJUIN2]
> Page : 5 -
> [MARJUIN2]
> Page : 6 -
> [MARJUIN2]


But, that should be :

> [MERMAI27]
> [...]
> [MARJUIN2]




2015-07-20 17:58 GMT+02:00 Gilad Denneboom <gilad.denneboom@gmail.com>:

> So are you calling this method each time with a different page number?
> What is the output to the console?
>
> On Mon, Jul 20, 2015 at 5:37 PM, Pierre Dubillot <alexcouter@gmail.com>
> wrote:
>
> > I did a short example with the same problem :
> >
> >
> > PDFTextStripperByArea cellules;
> > >
> > public ArrayList<String> scanCells(PDDocument pdf, int pagenum) throws
> > >> IOException{
> > >
> > > ArrayList<String> columnData = new ArrayList<String>();
> > >
> > > cellules.extractRegions(pdf.getPage(pagenum));
> > >
> > > cellules.setStartPage(pagenum);
> > >
> > > System.out.println("Page : " +  cellules.getStartPage() + " - " +
> > >> cellules.getEndPage());
> > >
> > > cellules.setSortByPosition(true);
> > >
> > > for(String region : cellules.getRegions()){
> > >
> > > // PDDocument pdoc = new PDDocument();
> > >
> > > // pdoc.addPage(page);
> > >
> > > // pdoc.save(new File("D:/Stage_DUT/pdfs/testing/" + region + ".pdf"));
> > >
> > > // pdoc.close();
> > >
> > > // System.out.println(cellules.getTextForRegion(region));
> > >
> > > columnData.add(cellules.getTextForRegion(region));
> > >
> > > System.out.println(columnData);
> > >
> > > }
> > >
> > > return columnData;
> > >
> > > }
> > >
> > >
> > Each region name is different (0page0, 0page1, 0page2, etc .. '0' is the
> > column number for the current page.) , and the method is called inside
> the
> > column object (which is created 7 times for 7 pages, in order to adjust
> > position of each column).
> >
> > 2015-07-20 11:30 GMT+02:00 Pierre Dubillot <alexcouter@gmail.com>:
> >
> > > Hi,
> > > I'm about to finalize my project (which will extract areas of text),
> but,
> > > here's my problem :
> > >
> > > I've got a document (
> > > http://www.cinemas-utopia.org/admin/grilles/toulouse/2015-07-21.pdf).
> > > Then, I split it into 7 pages (
> > >
> >
> http://www.docdroid.net/FedKhgp/2cac3a9c-b654-41c3-aea2-b473ccbeb06b.pdf.html
> > > ).
> > > In a XML file, I define which areas i've to extract. (For example : the
> > > current date of the page).
> > > I create my objects ... Then I process extraction, but it only extracts
> > > the last page :
> > > "[MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2,
> MARJUIN2]"
> > > (A list of the result after 7 pages extracted).
> > >
> > > I've to explain my code a little bit. First, I did a local app using
> > > 1.8.9, then I switched to a Web App and used 2.0.0, so, I changed my
> > code a
> > > little bit, and everything seems to stop working.
> > >
> > > There are 3 main objects to process extraction : Line or Column and
> Cell.
> > > A line/column can contain cells.
> > >
> > > public Cell(String zoneName, double x, double y, int largeur, int
> > hauteur)
> > >> throws IOException {
> > >> super();
> > >> [this ...]
> > >> uneCellule = new PDFTextStripperByArea();
> > >> uneCellule.addRegion(nomZone, new Rectangle2D.Double(x, y, largeur,
> > >> hauteur));
> > >> }
> > >>
> > >
> > > And, for example, a column will create attached cells.
> > >
> > >
> > >
> > > public Colonne(
> > >> int index,
> > >> String columnName,
> > >> double x,
> > >> double y,double hauteurCellule, double hauteurColonne, double
> > >> largeurColonne, double decalage) throws IOException
> > >> {
> > >> [...]
> > >> for(double i = y; i < (hauteurColonne + y);
> > i+=(hauteurCellule+decalage)){
> > >> cells.add(
> > >> new Cell(
> > >> nameOfCells.get(compteur),
> > >> x,
> > >> i,
> > >> (int) largeurColonne,
> > >> (int) hauteurCellule
> > >> )
> > >> );
> > >> compteur++;
> > >> }
> > >> }
> > >
> > >
> > > And my code to extract text (in my column object) :
> > >>
> > >> public ArrayList<String> scanCells(PDPage page) throws IOException{
> > >> ArrayList<String> columnData= new ArrayList<String>();
> > >> for(Cell cell: cells){
> > >> cell.extractRegion(page);
> > >> try {
> > >> columnData.add(getTextOfCell(cell));
> > >> } catch (ParseException e) {
> > >> e.printStackTrace();
> > >> }
> > >> }
> > >> return columnData;
> > >> }
> > >
> > >
> > > My documents are correctly loaded, each page is inside memory, I can't
> > > find on my own how to solve this, it's a strange problem. If you have
> any
> > > idea (how I could update my code)..
> > >
> > > Thanks,
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message