falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkat Ramachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-1240) Data Import and Export
Date Fri, 17 Jul 2015 19:03:04 GMT

    [ https://issues.apache.org/jira/browse/FALCON-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631744#comment-14631744

Venkat Ramachandran commented on FALCON-1240:

HCAT Related discussion: 


While writing to HCAT, you mentioned the concrete implementation (i.e. Sqoop) should use Falcon
provided facilities to write to HCAT. 
But, Sqoop extracts the data from database and directly writes to HCAT/Hive providing all
the needed partition keys — we won’t get a stream from sqoop first of all.

Also, I think Sqoop examines each row and maps the row based on a column to a partition in
HCatalog (dymanic partition and is done by Hcatalog) — Venkat, please confirm.

With this assumption, how can we utilize Falcon facilities to write to HCAT. If we by pass
and make Sqoop to do it, are there any issues? Will all the aspects of lifecycle work?



During the call we discussed about figuring out the hcat target based on the feed definition
in falcon. The suggestion wasn't to pull the data via sqoop and have additional work performed
by falcon to push this into hcat. Since falcon supports the concept of catalog based storage
for feeds, you have all the necessary information to complete the import into hive directly
via sqoop without having to redundantly declare any info relating to the hcat table in feed
definition are elsewhere.

Srikanth Sundarrajan


When we have a catalog storage, the data ingestion would pick the target to be a hcatalog
table and the static partition keys can be deduced from the storage description.   That is
good.   Sqoop from 1.4.5 allows multiple static partition keys.

What I have a conflict of thought in my mind since our conversation yesterday is with the
filtering aspect (which may be something that Venky may have had in his mind with his question).
 In my view, Falcon primarily moves data without doing any sort of change to the data it receives.
  Modifying data/transforming data will take Falcon into supporting new paradigms which would
need more architectural thought on the infrastructure to build and expose.

The SQL predicate usage is different in the sense that we let the SQL engine provide the data
and hence it is not as if the Falcon runtime works with it.    I think we should table the
filtering support for now.


One other thing, these are important discussions for Falcon. It would be ideal if we can allow
other folks such as Venkatesh, Pallavi, Shwetha, Sowmya ... to chime in if they have views
on this. Does it makes sense to move this discussions to public list ? 

Srikanth Sundarrajan

> Data Import and Export 
> -----------------------
>                 Key: FALCON-1240
>                 URL: https://issues.apache.org/jira/browse/FALCON-1240
>             Project: Falcon
>          Issue Type: New Feature
>          Components: acquisition
>            Reporter: Venkat Ramachandran
>            Assignee: Venkat Ramachandran
>         Attachments: Falcon Data Ingestion - Proposal.docx
> JIRA to track Data Import and Export design and implementation discussions
> Attaching proposal to start with.

This message was sent by Atlassian JIRA

View raw message