nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NIFI-296) Extend the capability of IdentifyMimeType and extract document metadata
Date Thu, 19 Feb 2015 00:38:11 GMT

    [ https://issues.apache.org/jira/browse/NIFI-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326813#comment-14326813
] 

ASF GitHub Bot commented on NIFI-296:
-------------------------------------

Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24956579
  
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java
---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES))
{
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT))
{
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final
int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    +            byte[] bytes = new byte[FlowFilePackagerV1.FILENAME_ATTRIBUTES.length()];
    +            in.read(bytes);
    --- End diff --
    
    I would use "StreamUtils.fillBuffer(in, bytes, false)" here, to make sure that we get
at least the number of bytes we need. I can't imagine that a call to read() in this case would
read fewer than the 19 bytes that we need... but just to be sure :)


> Extend the capability of IdentifyMimeType and extract document metadata
> -----------------------------------------------------------------------
>
>                 Key: NIFI-296
>                 URL: https://issues.apache.org/jira/browse/NIFI-296
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Joseph Witt
>            Priority: Minor
>
> Apache Tika is pretty awesome and can handle a large range of document types.  It could
perhaps be used to extend the capability of IdentifyMimeType and it could also potentially
be used to automatically extract document metadata/data as flow file attributes to be used
for data flow routing decisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message