lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spencer, Dave" <d...@lumos.com>
Subject RE: PDF parser for Lucene
Date Tue, 06 Nov 2001 23:02:28 GMT

Not knowing anything about PDF per-se I tried to dig into the "etymon
pj" lib
mentioned below and write a PDF-to-text converter (so PDFs could be
indexed).
I wasn't able to figure things out, but then saw a post that gave the
necessary
clues thus this code. I'll just insert it below and suggest that maybe
it goes into
Lucene, though I'm not totally happy w/ the results: so maybe this will
accelerate
someone elses development...

It's used like this:
File f;
Reader r = new Pdf2Text( f).parse();

and then 'r' will read out something that's close to ascii

If you call read() in a loop note that I get incomplete reads (less than
buffer size).

	 - Dave
 


package com.tropo.pdf;

import com.etymon.pj.*;
import com.etymon.pj.exception.*;
import com.etymon.pj.object.*;
import com.etymon.pj.object.pagemark.*;

import java.util.*;
import java.io.*;


/**
 * Modified conversion of PDF to ASCII Text from
 * <a
href="http://franklin.oit.unc.edu/cgi-bin/lyris.pl?visit=pj-help&id=1587
14211">http://franklin.oit.unc.edu/cgi-bin/lyris.pl?visit=pj-help&id=158
714211</a>.
 */
public final class Pdf2Text
	implements Runnable
{
	private Pdf myPdfDoc;
	private int currentPage = 0;
	private int prevPage = -1;
	private String lineBreak= "\n";
	private StringBuffer pdfBuff;
	private int cols = 14;



	/**
	 *
	 */
	public Pdf2Text( String PDFFileName)
		throws FileNotFoundException, IOException, PjException
	{
		myPdfDoc = new Pdf(PDFFileName);

		if (myPdfDoc.getEncryptDictionary() != null)
		{
			throw new PjException("File appears to be
encrypted.");
		}
	}

	/**
	 *
	 */
	public Pdf2Text( File f)
		throws FileNotFoundException, IOException, PjException
	{
		this( f.getCanonicalPath()); // limitation in Pdf
	}	


	/**
	 * Partses the PDF File
	 *
	 */
	public Reader parse()
		throws InvalidPdfObjectException,
			   PdfFormatException,
			   IOException
	{
		reader = new PipedReader();
		writer = new PipedWriter( reader); 
		new Thread( this).start();
		return reader;
	}
	/**
	 *
	 */
	public void run()
	{
		try
		{
			pdfBuff =new StringBuffer();

			int pagecount = myPdfDoc.getPageCount();

			for (int i = 1; i <= pagecount; i++)
			{
				currentPage = i;

				int pageref = myPdfDoc.getPage(i);

				PjPage page = (PjPage)
myPdfDoc.getObject(pageref);

				PjObject pagecont = page.getContents();

				if (pagecont instanceof PjArray)
					processCont((PjArray) pagecont);

				if (pagecont instanceof PjReference)
					processCont((PjReference)
pagecont);
			}
		}
		catch(  Throwable t)
		{
		}
		finally
		{
			try
			{
				writer.close();
			}
			catch( IOException ioe)
			{
			}
		}
			
	}

	/**
	 *
	 */
	public void printStream(PjStream Stream)
		throws PdfFormatException, InvalidPdfObjectException
	{

		StreamParser sp = new StreamParser();

		PjStream stream =
Stream.flateDecompress().ascii85Decode();
		try
		{

			byte[] bytes = stream.getBuffer();
			String st = new String(bytes);

			/** split in lines and clean up a bit */
			StringTokenizer tok = new
StringTokenizer(st,"\n");
			//StringBuffer outbuff = new
StringBuffer(st.length());
			if (prevPage != currentPage)

			{

				writer.write("PAGE: " + currentPage);
				prevPage = currentPage;
			}
			while (tok.hasMoreTokens())

			{
				String line = tok.nextToken();

				if (line.indexOf("[") >= 0)

				{
					StringTokenizer t = new
StringTokenizer(line, "()");
					if (t.countTokens() > 0)

					{
						boolean b = false;
						while
(t.hasMoreTokens())
						{

							String token =
t.nextToken();
							if (b==true)
							{
	
writer.write(token);
								b =
false;
							} else
								b =
true;

						}

					}
					else
						writer.write(line);

				}
				else
					writer.write(line);


				writer.write( "\n");
			}
		}
		catch (Exception ese)
		{
			ese.printStackTrace();
		}
	}

	/**
	 *
	 */
	public void processCont(PjArray array)
		throws PdfFormatException, InvalidPdfObjectException
	{

		Vector vec = array.getVector();
		for (int i = 0; i < vec.size(); i++)
		{

			PjObject obj = (PjObject) vec.get(i);

			if (obj instanceof PjReference)
			{
				PjReference ref = (PjReference) obj;
				processCont(ref);
			}
			if (obj instanceof PjStream)
				printStream((PjStream) obj);
		}

	}

	/**
	 *
	 */	
	public void processCont(PjReference ref)
		throws PdfFormatException, InvalidPdfObjectException
	{

		int refnum = ref.getObjNumber().getInt();

		PjObject obj = myPdfDoc.getObject(refnum);

		if (obj instanceof PjStream)
			printStream((PjStream) obj);
	}


	/**
	 *
	 */
	public static void main(String args[])
	{
		String pdfFile = args[0];

		try
		{
			Pdf2Text pdf = new Pdf2Text(pdfFile);

			Reader r = pdf.parse();

			char[] buf = new char[ 1024];
			int nr;
			while ( (nr = r.read( buf)) > 0)
			{
				o.println( new String( buf, 0, nr));
				//o.println( "READ " + nr);
			}
		}
		catch(Exception e)
		{
			e.printStackTrace();
		}
	}

	private static final PrintStream o = System.out;
	
	private PipedReader reader;
	private PipedWriter writer;

}





-----Original Message-----
From: Nestel, Frank [mailto:frank.nestel@coi.de]
Sent: Tuesday, November 06, 2001 1:25 AM
To: 'Lucene Developers List'
Subject: AW: PDF parser for Lucene



Websearch does a very quick and very dirty job, searching more
or less heuristical for text in an PDF.
It fails for UTF-8 encoded fields and for all kinds text
which is not where the heuristics expect.
On the other hand it is surprising how far you get with such
a simple method.

There is a free library somewhere at http://www.etymon.com/pj/.
It seems to contain minor problems, but is fairly robust. But
it uses lots of CPU and Memory since it builds a proper internal
representation of a PDF. It can also manipulate PDFs.

Problem is that the PDF format does not store the encoding of
certain fields at all. After all we had test PDFs where even
the tools provided by Adobe failed to exptract the complete
textual content. For me this was kind of disappointing about 
the PDF format. But of course it is allready there, one has
to handle it ...

We implemented bot above libraries but used the one from
Websearch right now. This renders some documents hardly
findable.

Sigh,
Frank



> -----Ursprüngliche Nachricht-----
> Von: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> Gesendet am: Freitag, 2. November 2001 18:26
> An: Lucene Developers List
> Betreff: RE: PDF parser for Lucene
> 
> You could try this one:
>   http://www.i2a.com/websearch/
> 
> ...and then tell me how it works for you.
> =:o)
> 
> 
> Anyway, it is simple and Open Source.
> 
> 
> Have fun,
> Paulo Gaspar
> 
> http://www.krankikom.de
> http://www.ruhronline.de
> 
> 
> 
> > -----Original Message-----
> > From: Benoît Doret [mailto:benoit.doret@infodesign.com]
> > Sent: Friday, November 02, 2001 5:00 PM
> > To: lucene-dev@jakarta.apache.org
> > Subject: PDF parser for Lucene
> >
> >
> > Hello,
> >
> > Does a fully integrated PDF parser already exists for Lucene?
> > (some PDFDocument and PDFParser classes would be great!)
> > Does some sells it, or is it released as open source?
> > If not, does someone has already used an external java library to
> > parse pdf files and which one?
> >
> > Thanks in advance,
> > Benoît Doret.
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
> 
> 
> --
> To unsubscribe, e-mail:   
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message