pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "HowlJournal" by AlanGates
Date Mon, 06 Dec 2010 22:26:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "HowlJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlJournal?action=diff&rev1=4&rev2=5

--------------------------------------------------

  == Work in Progress ==
  
  || Feature           || Description ||
- || Authentication    || Integrate Howl with security work done on Hadoop so that users can
be properly authenticated. ||
+ || Authentication    || See HowlAuthentication ||
  || Authorization     || See HowlAuthorizationProposal ||
+ || Data Import/Export || See HowlImportExport ||
  
  
  == Proposed Work ==
@@ -34, +35 @@

  '''Allow specification of general storage type'''<<BR>> Currently Hive allows
the user to specify specific storage formats for a table.  For example, the user can say `STORED
AS RCFILE`.  We would like to enable users to select general storage types (columnar, row,
or text) without needing know the underlying format being used.  Thus it would be legal to
say `STORED AS ROW` and let the administrators decide whether sequence file or tfile is used
to store data in row format.
  
  '''Mark a set of partitions done''' <<BR>> Often users create a collection of
data sets altogether, though different sets may be completed at different times.  For example,
users might partition their web server logs by date and region.  Some users may wish to only
read a particular region and are not interested in waiting until all of the regions are completed.
 Others will want to wait until all regions are completed before beginning processing.  Since
all partitions are committed individually, Howl has no way for users to know when all partitions
for the day are present.  A way is needed for the writer to signal that all partitions with
a given key value (such as date = today) are complete and users waiting for the entire collection
can now begin.  This will need to be propagated through to the notification system.
- 
- '''Data Import/Export''' <<BR>> Howl currently provides a single input and output
format (or loader or serde) that can be used for any data in Howl.  However, users would like
to be able to take this data out of Howl in preparation for moving it off the grid.  They
would also like to be able to prepare data for import into Howl when they are running jobs
that may not be able to interact with Howl.  An import/export format will be defined that
allows data to be imported into, exported from, and replicated between Howl instances.  This
format will provide an !InputFormat and !OutputFormat as well as a Pig load and store function
and a Hive !SerDe.  The collections of data created by these tools will contain schema information,
storage information (that is, what underlying format is the data in, how is it compressed,
etc.), and sufficient metadata to create it in another Howl instance.
  
  '''Data Compaction''' <<BR>> Very frequently users wish to store data in a very
fine grained manner because their queries tend to access only specific partitions of the data.
 Consider, for example, if a user downloads logs from the website for all twenty countries
it operates in, every hour, and keeps those logs for a year, and each hour has one hundred
part files.  That's 1,720,000 files for just this one input.  This places a significant burden
on the namenode.  A way is needed to compact these into a larger file while preserving the
ability to address individual partitions.  This compaction may be done when the file is being
written, done soon after the data is written, or done at some later point.  For an example
of the last case consider the example of hourly data.  For the first few days hourly data
may have significant value.  After a week, it is less likely that users will be interested
in any given hour of data.  So the hourly data may be compacted into daily data after a week.
 Small performance degradation will be acceptable to achieve this compaction.  har will be
evaluated for implementing this feature.  Whether this compaction is automatically initiated
by Howl or requires user or administrator initiation is TBD.
  
@@ -53, +52 @@

  
  '''Schema Evolution'''<<BR>>  Currently schema evolution in Hive is limited
to adding columns at the end of the non-partition keys columns.  It may be desirable to support
other forms of schema evolution, such as adding columns in other parts of the record, or making
it so that new partitions for a table no longer contain a given column.
  
+ '''Support for streaming'''<<BR>>  Currently Howl does not support Hadoop streaming
users.  It should.
+ 
+ '''Integration with Hbase'''<<BR>>  Currently Howl does not support Hbase tables.
 It needs to have storage drivers so that !HowlInputFormat and !HowlLoader can do bulk reads
and !HowlOutputFormat and !HowlStorage can do bulk writes.  We also need to understand what,
if any, interface it makes sense for Howl to expose for point reads and writes for Howl tables
that use Hbase as a storage mechanism.
+ 

Mime
View raw message