lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adriano Labate" <lab...@verticali.com>
Subject RE : Parsers
Date Wed, 28 May 2003 13:25:38 GMT
Pete,

Here's some samples.

For Word using Textmining:
	String textContent = new
WordExtractor().extractText(inputStream);

For PDF using Textmining:
	String textContent = new
PDFExtractor().extractText(inputStream);

For Excel using POI:
(From
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakart
a.apache.org&msgId=698633)

    /**
     * Extract text from an Microsoft Excel input stream.
     * @param inputStream
     * @return The raw text obtained by concatenating all text cells
from top to bottom, left to right.
     * @throws IOException
     */
    private static String extractExcelContent(InputStream inputStream)
throws IOException {
        HSSFWorkbook wb = new HSSFWorkbook(inputStream);
        int nbSheets = wb.getNumberOfSheets();
        StringBuffer content = new StringBuffer(1024);

        for (int i = 0; i < nbSheets; i++) {
            HSSFSheet sheet = wb.getSheetAt(i);
            int nbRows = sheet.getLastRowNum();

            for (int j = 0; j < nbRows; j++) {
                HSSFRow row = sheet.getRow(j);
                if (row == null)    // empty row
                    continue;

                boolean isLineFound = false;
                Iterator it = row.cellIterator();
                while (it.hasNext()) {
                    HSSFCell cell = (HSSFCell)it.next();
                    int type = cell.getCellType();

                    if (type == HSSFCell.CELL_TYPE_STRING) {
                        content.append(cell.getStringCellValue());
                        content.append(" ");
                        isLineFound = true;
                    }
                }

                if (isLineFound)
                    content.append("\n");       // separate lines/raws
            }
        }

        return content.toString();
    }

Adriano


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk] 
Envoyé : mercredi, 28 mai 2003 15:02
À : Lucene Users List
Objet : Re: Parsers


Hi Adriano

Thanks.  Code samples would be nice :)

Will come back if I find something for .ppt.

Pete

----- Original Message -----
From: "Adriano Labate" <labate@verticali.com>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Sent: Wednesday, May 28, 2003 1:03 PM
Subject: RE : Parsers


The www.textmining.org text extractors work very well for Word and pdf
documents. They use both PDFBox and POI.

For Excel, using POI directly is very easy. Tell me if you want to see
code samples.

I'm looking myself for a Powerpoint text extractor, if you know one...

Adriano Labate


-----Message d'origine-----
De : Pete Lewis [mailto:pete@uptima.co.uk]
Envoyé : mercredi, 28 mai 2003 12:48
À : Lucene Users List
Objet : Parsers


Hi all,

I have a rather nice html parser that I got from SourceForge.  Does
anyone know of any good parsers for pdf and Microsoft Office Suite
(.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message