hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5687) Streaming support in Hive
Date Sun, 06 Apr 2014 17:39:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961469#comment-13961469
] 

Owen O'Malley commented on HIVE-5687:
-------------------------------------

Initial comments
* Please create a package-info.java (or package.html) for the entire package that has the
text from the design document, but without the example.
* I believe the API will go through some iterations and use before it becomes stable. We should
warn users that it will likely evolve in future versions of Hive and won't necessarily be
backwards compatible. The package-info.java is probably the best place to place the warning.
* The current API requires users to implement a RecordWriter wrapper for each SerDe they want
to use. In Hive 0.14, I think we need to revisit this and switch to just requiring a serde
class name and a string to string map of serde properties. This way, any of a user's current
SerDes can be used to parse the byte[] and there can be a generic method for constructing
the object using reflection.
* The code shouldn't depend on OrcOutputFormat, but instead find the OutputFormat of the table/partition
and use that. The streaming code should only require that it implement AcidOutputFormat.
* The RecordWriter should be passed the HiveConf rather than create it. It will make it easier
to do unit tests.
* The StreamingIntegrationTester needs to print the exception's getMessage to stderr if the
options don't parse correctly. Otherwise, the user doesn't get any clue as to which parameter
they forgot.
* I don't see how the column reordering can be invoken. The SerDe is using the table properties
from the table in the MetaStore to define the columns it returns, so the two should always
be the same. My suggestion is to remove all of the column reordering code.
* If you don't remove the column ordering code, you should deserialize and then reorder the
columns rather than the current strategy of deserialize, reorder, serialize, and deserialize.
* Revert the change that adds startMetaStore. It isn't called and thus shouldn't be added.
* The method writeImpl(byte[]) doesn't add any value and should just be inlined.
* Why do you use DDL to create partitions rather than the MetaStoreClient API that you everywhere
else?

Some style guidelines:
* Please split the lines that are longer than 100 characters and it is even better if they
are less than 80 characters.
* Ensure your if statements have a space before the parenthesis.
* Remove the commented out code.
* Please remove the private uncalled functions (eg. HiveEndPoint.newConnection(String proxyUser,
...) )
* You've defined a lot of exceptions in this API. Is the user expected to handle each exception
separately? Your throw declarations don't list the exact exceptions that are thrown. You'd
be better off with different exceptions only when the user is expected to be able to handle
a specific error. Otherwise, you might as well use StreamingException with a descriptive error
message for everything.


> Streaming support in Hive
> -------------------------
>
>                 Key: HIVE-5687
>                 URL: https://issues.apache.org/jira/browse/HIVE-5687
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>              Labels: ACID, Streaming
>             Fix For: 0.13.0
>
>         Attachments: 5687-api-spec4.pdf, 5687-draft-api-spec.pdf, 5687-draft-api-spec2.pdf,
5687-draft-api-spec3.pdf, HIVE-5687-unit-test-fix.patch, HIVE-5687.patch, HIVE-5687.v2.patch,
HIVE-5687.v3.patch, HIVE-5687.v4.patch, HIVE-5687.v5.patch, Hive Streaming Ingest API for
v3 patch.pdf, Hive Streaming Ingest API for v4 patch.pdf
>
>
> Implement support for Streaming data into HIVE.
> - Provide a client streaming API 
> - Transaction support: Clients should be able to periodically commit a batch of records
atomically
> - Immediate visibility: Records should be immediately visible to queries on commit
> - Should not overload HDFS with too many small files
> Use Cases:
>  - Streaming logs into HIVE via Flume
>  - Streaming results of computations from Storm



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message