hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yeshwanth kumar <yeshwant...@gmail.com>
Subject tika parser is not parsing the BytesWritable in mapreduce
Date Wed, 11 Jun 2014 10:48:41 GMT
i am writing a mapreduce job,

where it takes a zip file as input, zip file contains different types of
documents such as docx odt pdf txt,

 i am using tika parser to parse the documents.

here's the code snippet of my mapper method

public void map(Text key, BytesWritable value, Context context)throws
IOException, InterruptedException {

    ------------------------------

    ------------------------------

        logger.info("Length:\t" + value.getLength());

        byte[] bytesbefore = value.getBytes();

        logger.info("CONTENT BEFORE" + new String(bytesbefore));

        InputStream in = new ByteArrayInputStream(bytesbefore);

        Metadata metadata = new Metadata();

        String mimeType = new Tika().detect(in);

        metadata.set(Metadata.CONTENT_TYPE, mimeType);

        Parser parser = new AutoDetectParser();

        ContentHandler handler = new BodyContentHandler(

                value.getLength());

        try {

            parser.parse(in, handler, metadata, new ParseContext());

        } catch (SAXException e1) {

            logger.info(e1.getMessage());

            e1.printStackTrace();

        } catch (TikaException e1) {

            logger.info(e1.getMessage());

            e1.printStackTrace();

        }

        in.close();

        logger.info("Content AFTER" + handler.toString());

    ------------------------------

                   }

output is written to hbase, content of the document is empty after parsing ,

am i missing anything here??

Mime
View raw message