falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkatesh Seetharam (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-129) Disable Late data handling for hive tables
Date Wed, 16 Oct 2013 18:35:44 GMT

    [ https://issues.apache.org/jira/browse/FALCON-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797093#comment-13797093

Venkatesh Seetharam commented on FALCON-129:

Thanks a ton [~sriksun] for taking time to review this humongous patch.

bq. 2. Possibly incorrect checkstyle warning supression
Good catch. 

bq. 3. Process involving table storage shouldn't be considered for late handling
Very good catch. Thanks!

bq. 4. FeedCleanupHandler, uses the FileStatus array for deletion.
Will do. Missing in the abstract handler as well, will add the check in delete method.

bq. 5. Would it help to have test cases added to FeedEvictor for catalog storage type.
The tests are covered in int-tests since mocking static CatalogService is hard. org.apache.falcon.catalog.TableStorageFeedEvictorIT
- covers both managed and external tables.

bq. 6. From FeedEntityParser code it looks like feed entities with late arrival section is
Parse is called but not validate in common module. All validations that requires services
are in int-tests. Hence this is not caught. Will definitely change the entity.

bq. 7. Any specific reason to comment out this in oozie-workflow-0.3.xsd
Good question. I had to add the any namespace for hive actions in replication and that had
a conflict with another any for sla. Hence I commented the sla out as we are not using this
in falcon and it is too specific to Yahoo! and GMS.
{code}<xs:any namespace="##other" minOccurs="1" maxOccurs="1"/>{code}
There are ways to override it:
* with specific bindings in jaxb but I thought it was unnecessary anyways
* having java actions instead of hive for import and export - we should do this in future
so its portable across oozie

bq. This is indeed a very complex feature and patch is very clean and changes are fairly intuitive.
Thanks! :-)

Plan to upload the cumulative patch in this jira. 

> Disable Late data handling for hive tables
> ------------------------------------------
>                 Key: FALCON-129
>                 URL: https://issues.apache.org/jira/browse/FALCON-129
>             Project: Falcon
>          Issue Type: Sub-task
>    Affects Versions: 0.3
>            Reporter: Venkatesh Seetharam
>            Assignee: Venkatesh Seetharam
>         Attachments: FALCON-129.patch, FALCON-129-r1.patch
> HCat nor Hive APIs expose internal stats about a given partition. The only way to get
the partition size is to get the location of the partition on HDFS and then use globStatus
and contentSummary APIs.

This message was sent by Atlassian JIRA

View raw message