hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1824) want InputFormat for zip files
Date Tue, 29 Jan 2008 13:19:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563510#action_12563510

Ankur commented on HADOOP-1824:

> The need here is to ...
Callback from C to Java is fine for read(). But seek() might be an issue since for true random
access we need to be able to seek forward and backwards from  
1. start of the stream
2. current pos of the stream
3. end of the stream

After taking a deep dive into the minizip code and implementing some POC code I am not sure
how a seek() callback from C to java might be implemented in way that can be leveraged from
existing minizip parser code. Any suggestions ?

Just to give an idea, here is a some sample code for read() that I implemented. 

// including zlib & minizip libraries
#include "unzip.h"

// including java library
#include <jni.h>
#include "ZipInputFormat.h"

//defining read() and seek() IO APIs

uLong ZCALLBACK fread_file_func
( voidpf opaque, voidpf stream, void* buf, uLong size)


    jlong bytesRead;
    JNIEnv *env = (JNIEnv *) opaque;
    jobject javaStream = (jobject) stream;
    jclass dataInputStream = (*env)->GetObjectClass(env, stream);
    jmethodID MID_read = (*env)->GetMethodID(env, dataInputStream, "read", "([BII)I");
    if(MID_read == NULL)    {
	printf("\nfread_file_func(): read() method not found");
    else    {
	jbyteArray byteArray = (*env)->NewByteArray(env, size);
        bytesRead = (*env)->CallIntMethod(env, javaStream, MID_read, byteArray, 0, size);
        (*env)->GetByteArrayRegion(env, byteArray, 0, bytesRead, buf);
	printf("\nNumber of bytes read: %u\n", bytesRead );	

    return bytesRead;


// the native function exposed to Java, declared as a static method
// dataStream is of type java.io.DataInputStream.
// zipClass is of type ZipInputformat

JNIEXPORT void JNICALL Java_ZipInputFormat_display
  (JNIEnv *env, jclass zipClass, jobject dataStream)
  unsigned char * buf = (unsigned char *) malloc( sizeof (unsigned char) * 1024 * 64);
  fread_file_func(env, dataStream, buf, 64*1024);

> want InputFormat for zip files
> ------------------------------
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small
files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable
to be able to split inputs into smaller chunks, with one or more small original file per split.
 The zip format, unlike tar, permits enumeration of files in the archive without scanning
the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives
into splits that contain one or more archived files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message