pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Dubillot <alexcou...@gmail.com>
Subject Re: Extraction problems with PDFTextStripperByArea
Date Thu, 23 Jul 2015 08:13:45 GMT
Hi, i'm still having the same issue, do you need more details ?

Best regards,

Pierre

2015-07-20 18:05 GMT+02:00 Pierre Dubillot <alexcouter@gmail.com>:

> Yes, the method is called over another algorithm, which loops over each
> page, and also with a different page number.
>
> Console output :
>
>> Page : 0 -
>> [MARJUIN2]
>> Page : 1 -
>> [MARJUIN2]
>> Page : 2 -
>> [MARJUIN2]
>> Page : 3 -
>> [MARJUIN2]
>> Page : 4 -
>> [MARJUIN2]
>> Page : 5 -
>> [MARJUIN2]
>> Page : 6 -
>> [MARJUIN2]
>
>
> But, that should be :
>
>> [MERMAI27]
>> [...]
>> [MARJUIN2]
>
>
>
>
> 2015-07-20 17:58 GMT+02:00 Gilad Denneboom <gilad.denneboom@gmail.com>:
>
>> So are you calling this method each time with a different page number?
>> What is the output to the console?
>>
>> On Mon, Jul 20, 2015 at 5:37 PM, Pierre Dubillot <alexcouter@gmail.com>
>> wrote:
>>
>> > I did a short example with the same problem :
>> >
>> >
>> > PDFTextStripperByArea cellules;
>> > >
>> > public ArrayList<String> scanCells(PDDocument pdf, int pagenum) throws
>> > >> IOException{
>> > >
>> > > ArrayList<String> columnData = new ArrayList<String>();
>> > >
>> > > cellules.extractRegions(pdf.getPage(pagenum));
>> > >
>> > > cellules.setStartPage(pagenum);
>> > >
>> > > System.out.println("Page : " +  cellules.getStartPage() + " - " +
>> > >> cellules.getEndPage());
>> > >
>> > > cellules.setSortByPosition(true);
>> > >
>> > > for(String region : cellules.getRegions()){
>> > >
>> > > // PDDocument pdoc = new PDDocument();
>> > >
>> > > // pdoc.addPage(page);
>> > >
>> > > // pdoc.save(new File("D:/Stage_DUT/pdfs/testing/" + region +
>> ".pdf"));
>> > >
>> > > // pdoc.close();
>> > >
>> > > // System.out.println(cellules.getTextForRegion(region));
>> > >
>> > > columnData.add(cellules.getTextForRegion(region));
>> > >
>> > > System.out.println(columnData);
>> > >
>> > > }
>> > >
>> > > return columnData;
>> > >
>> > > }
>> > >
>> > >
>> > Each region name is different (0page0, 0page1, 0page2, etc .. '0' is the
>> > column number for the current page.) , and the method is called inside
>> the
>> > column object (which is created 7 times for 7 pages, in order to adjust
>> > position of each column).
>> >
>> > 2015-07-20 11:30 GMT+02:00 Pierre Dubillot <alexcouter@gmail.com>:
>> >
>> > > Hi,
>> > > I'm about to finalize my project (which will extract areas of text),
>> but,
>> > > here's my problem :
>> > >
>> > > I've got a document (
>> > > http://www.cinemas-utopia.org/admin/grilles/toulouse/2015-07-21.pdf).
>> > > Then, I split it into 7 pages (
>> > >
>> >
>> http://www.docdroid.net/FedKhgp/2cac3a9c-b654-41c3-aea2-b473ccbeb06b.pdf.html
>> > > ).
>> > > In a XML file, I define which areas i've to extract. (For example :
>> the
>> > > current date of the page).
>> > > I create my objects ... Then I process extraction, but it only
>> extracts
>> > > the last page :
>> > > "[MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2, MARJUIN2,
>> MARJUIN2]"
>> > > (A list of the result after 7 pages extracted).
>> > >
>> > > I've to explain my code a little bit. First, I did a local app using
>> > > 1.8.9, then I switched to a Web App and used 2.0.0, so, I changed my
>> > code a
>> > > little bit, and everything seems to stop working.
>> > >
>> > > There are 3 main objects to process extraction : Line or Column and
>> Cell.
>> > > A line/column can contain cells.
>> > >
>> > > public Cell(String zoneName, double x, double y, int largeur, int
>> > hauteur)
>> > >> throws IOException {
>> > >> super();
>> > >> [this ...]
>> > >> uneCellule = new PDFTextStripperByArea();
>> > >> uneCellule.addRegion(nomZone, new Rectangle2D.Double(x, y, largeur,
>> > >> hauteur));
>> > >> }
>> > >>
>> > >
>> > > And, for example, a column will create attached cells.
>> > >
>> > >
>> > >
>> > > public Colonne(
>> > >> int index,
>> > >> String columnName,
>> > >> double x,
>> > >> double y,double hauteurCellule, double hauteurColonne, double
>> > >> largeurColonne, double decalage) throws IOException
>> > >> {
>> > >> [...]
>> > >> for(double i = y; i < (hauteurColonne + y);
>> > i+=(hauteurCellule+decalage)){
>> > >> cells.add(
>> > >> new Cell(
>> > >> nameOfCells.get(compteur),
>> > >> x,
>> > >> i,
>> > >> (int) largeurColonne,
>> > >> (int) hauteurCellule
>> > >> )
>> > >> );
>> > >> compteur++;
>> > >> }
>> > >> }
>> > >
>> > >
>> > > And my code to extract text (in my column object) :
>> > >>
>> > >> public ArrayList<String> scanCells(PDPage page) throws IOException{
>> > >> ArrayList<String> columnData= new ArrayList<String>();
>> > >> for(Cell cell: cells){
>> > >> cell.extractRegion(page);
>> > >> try {
>> > >> columnData.add(getTextOfCell(cell));
>> > >> } catch (ParseException e) {
>> > >> e.printStackTrace();
>> > >> }
>> > >> }
>> > >> return columnData;
>> > >> }
>> > >
>> > >
>> > > My documents are correctly loaded, each page is inside memory, I can't
>> > > find on my own how to solve this, it's a strange problem. If you have
>> any
>> > > idea (how I could update my code)..
>> > >
>> > > Thanks,
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message