hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From atreju <n.atr...@gmail.com>
Subject Re: How to apply RDBMS table updates and deletes into Hadoop
Date Thu, 10 Jun 2010 18:44:40 GMT
Thank you for your response. I understand... Just a few points before I
accept that this is too complicated :)

The main idea is to keep different versions of data under the same table,
similar to HBase but this is row level and you don't have to make the other
versions accessible from Hive but only the most recent one. You just need to
create an access layer to work on the most recent version of the row. If you
can think of a different way of uniquely identifying a row to know the
versions of it and timestamp (or counter or version #??) to know the most
recent one, it doesn't have to be the columns that I specified before. It
can be a different file that you create in the background (which can also be
the index file!!).

Oracle has ROWID for physical location of the row and locks it before the
data manipulation. Hadoop has advantage of storage and map-reduce. So why
not use it and keep all versions of changed data and access it via
map-reduce for the most recent one. Accessing the data can get slower over
time when there are many versions. And that can be fixed with flush or full
replication of data time to time in a maintenance window by the end user.

Hive is a great tool to access and manipulate Hadoop files. You are doing an
amazing job. I have no idea what are the complications you face each day.
Just disregard if I am talking nonsense to you keep up the good work!



> Your work is great. Personally I would not get too tied up in the
> transactional side of hive. Once you start dealing with locking and
> concurrency the problem becomes tricky.
> We hivers have a long time tradition on 'punting' on complicated stuff we
> do not want to deal with. :) Thus we only have 'Insert Overwrite' no 'insert
> update' :)
> Again, I think you wrote a really cool application. It would make a great
> use case, blog post, or a stand alone application. Call it HiveMysqlRsync or
> something :). However you mention several requirements that are specific to
> your application timestamp and primary key. If you can abstract all your
> application specific logic it could make it's way into hive. But it might be
> a stand alone program because hive to rdbms replication might be a little
> out of scope.
> Edward

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message