Return-Path: Delivered-To: apmail-incubator-pig-commits-archive@locus.apache.org Received: (qmail 600 invoked from network); 20 Nov 2007 22:32:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Nov 2007 22:32:33 -0000 Received: (qmail 49827 invoked by uid 500); 20 Nov 2007 22:32:20 -0000 Delivered-To: apmail-incubator-pig-commits-archive@incubator.apache.org Received: (qmail 49801 invoked by uid 500); 20 Nov 2007 22:32:20 -0000 Mailing-List: contact pig-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@incubator.apache.org Delivered-To: mailing list pig-commits@incubator.apache.org Received: (qmail 49792 invoked by uid 99); 20 Nov 2007 22:32:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2007 14:32:20 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2007 22:32:19 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 1D87DD2D5 for ; Tue, 20 Nov 2007 22:31:58 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: pig-commits@incubator.apache.org Date: Tue, 20 Nov 2007 22:31:58 -0000 Message-ID: <20071120223158.7083.11967@eos.apache.org> Subject: [Pig Wiki] Update of "PigAbstractionLayer" by AntonioMagnaghi X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by AntonioMagnaghi: http://wiki.apache.org/pig/PigAbstractionLayer New page: ##master-page:FrontPage #format wiki #language en #pragma section-numbers off = Pig Abstraction Layer = == Introduction and Rational == Many of the activities that Pig carries out during the compilation and execution stages of Pig Latin queries are, currently, deeply tied to the Hadoop file system and Hadoop Map-Reduce paradigm. For instance, file management tasks, job submission and job tracking in the Pig client explicitly assume the availability of a Hadoop cluster to which the client connects. It is possible, however, to envision an architecture where the front-end part of the system (i.e. Pig client) may have a more abstract notion of the back-end portion. In this context, a Hadoop cluster could be regarded as a particular instance amongst a family of different back-ends, all of which provide similar functionalities that can be accessed via the same API. The main motivations behind this proposal can be summarized as follows: - The availability of well-defined APIs that a back-end needs to support in order to run Pig Latin queries can facilitate porting such APIs to different platforms. Hence, this could foster wider adoption of Pig. - Changes in various back-ends can be encapsulated within the actual implementation of the generic APIs. Hence, fewer modifications to the front-end code-base will result in a more stable code-base. A proper API design should be general enough to easily support various back-ends that are currently supported by Pig like: Hadoop, Galago (see section below) and the local back-end (i.e. the local file system and the local execution type.) == Relevant links == [http://www.galagosearch.org/ Galago] is a research project started by Trevor Strohman at the University of Massachusetts, Amherst. Galago is a search-engine with its own execution back-end. Galago is able to execute Pig Latin queries by translating them into its own representation language (TupleFlow jobs.) == API Specification == The basic functionalities that a back-end may need to export to the Pig client could be categorized into two main abstractions: - '''Data Storage''': provides functionalities that pertain to storing and retrieving data. It encapsulates the typical operations supported by file systems like creating, opening (for reading or writing) a data object. - '''Query Execution/Tracking''': provides functionalities to parse a Pig Latin program and submit a compiled Pig job to a back-end. This API should enable the front-end to track the current status of a job, its progress, diagnostic information and possibly to terminate it. The sections below provide some initial suggestions for possible APIs for the Data Storage and Query Execution abstractions. === Back-End Configuration === This interface abstracts functionalities for management of configuration information for both the Data Storage and Query Execution portions of a back-end. {{{ package org.apache.pig.backend; import java.io.Serializable; import java.util.Map; import java.net.URI; /** Abstraction for a generic property object that can be * used to specify configuration information, stats... * Information is represented in the form of (key, value) * pairs. */ public interface PigBackEndProperties extends Serializable, Iterable { /** * Introduces a new (key, value) pair or updates one already * associated to key. * * @param key - the key to insert/update * @param value -the value for the given key * @return - the value of the old key, if it exists, null otherwise */ public Object setValue(String key, Object value); /** * Given a resource, update configuration information. * * @param resource from which property values come from. * @return the set of keys and relative values that has been updated. * If resource contains/updates the same key multiple * times, only the initial value of key is returned. */ public Map addFromResource(URI resource); /** * Creates or Updates (key,value) pairs with information * from other * * @param other - source of properties * @return - keys that have been updated, if any, and the * corresponding old values */ public Map merge(PigBackEndProperties other); /** * Removes (key, value) pair if present * @param key - key to remove * @return - value of key, if key was present, null otherwise */ public Object delete(String key); /** * Returns value of a key * @param key * @return value of key if present, null otherwise. */ public Object getValue(String key); /** * @return number of (key, value) pairs stored */ public long getCount(); } }}} === Data Storage === This is a possible API for a generic interface that abstracts on the actual details used to store/persist collections of objects. {{{ package org.apache.pig.datastorage; import org.apache.pig.backend.PigBackEndProperties; import java.io.Serializable; import java.util.Map; import java.net.URI; /** * Abstraction for a generic property object that can be * used to specify configuration information, stats... public interface DataStorageProperties extends PigBackEndProperties { ... } }}} {{{ package org.apache.pig.datastorage; public interface DataStorage { /** * Place holder for possible initialization activities. */ public void init(); /** * Clean-up and releasing of resources. */ public void close(); /** * Provides configuration information about the storage itself. * For instance global data-replication policies if any, default * values, ... Some of such values could be overridden at a finer * granularity (e.g. on a specific object in the Data Storage) * * @return - configuration information */ public DataStorageProperties getConfiguration(); /** * Provides a way to change configuration parameters * at the Data Storage level. For instance, change the * data replication policy. * * @param newConfiguration - the new configuration settings * @throws when configuration conflicts are detected * */ public void updateConfiguration(DataStorageProperties newConfiguration) throws DataStorageConfigurationException; /** * Provides statistics on the Storage: capacity values, how much * storage is in use... * @return statistics on the Data Storage */ public DataStorageProperties getStatistics(); /** * Creates an entity handle for an object (no containment * relation) * * @param name of the object * @return an object descriptor * @throws DataStorageException if name does not conform to naming * convention enforced by the Data Storage. */ public DataStorageElementDescriptor asElement(String name) throws DataStorageException; /** * Created an entity handle for a container. * * @param name of the container * @return a container descripto * @throws DataStorageException if name does not conform to naming * convention enforced by the Data Storage. */ public DataStorageContainerDescriptor asContainer(String name) throws DataStorageException; } }}} === Data Storage Descriptors === {{{ package org.apache.pig.datastorage; public interface DataStorageElementDescriptor extends Comparable { /** * Opens a stream onto which an entity can be written to. * * @param configuration information at the object level * @return stream where to write * @throws DataStorageException */ public DataStorageOutputStream create( DataStorageProperties configuration) throws DataStorageException; /** * Copy entity from an existing one, possibly residing in a * different Data Storage. * * @param dstName name of entity to create * @param dstConfiguration configuration for the new entity * @param removeSrc if src entity needs to be removed after copying it * @throws DataStorageException for instance, configuration * information for new entity is not compatible with * configuration information at the Data * Storage level, user does not have privileges to read from * source entity or write to destination storage... */ public void copy(DataStorageElementDescriptor dstName, DataStorageProperties dstConfiguration, boolean removeSrc) throws DataStorageException; /** * Open for read a given entity * * @return entity to read from * @throws DataStorageExecption e.g. entity does not exist... */ public DataStorageInputStream open() throws DataStorageException; /** * Open an element in the Data Storage with support for random access * (seek operations). * * @return a seekable input stream * @throws DataStorageException */ public DataStorageSeekableInputStream sopen() throws DataStorageException; /** * Checks whether the entity exists or not * * @param name of entity * @return true if entity exists, false otherwise. */ public boolean exists(); /** * Changes the name of an entity in the Data Storage * * @param newName new name of entity * @throws DataStorageException */ public void rename(DataStorageElementDescriptor newName) throws DataStorageException; /** * Remove entity from the Data Storage. * * @throws DataStorageException */ public void delete() throws DataStorageException; /** * Retrieve configuration information for entity * @return configuration */ public DataStorageProperties getConfiguration(); /** * Update configuration information for this entity * * @param newConfig configuration * @throws DataStorageException */ public void updateConfiguration(DataStorageProperties newConfig) throws DataStorageException; /** * List entity statistics * @return DataStorageProperties */ public DataStorageProperties getStatistics(); } }}} {{{ package org.apache.pig.datastorage; import org.apache.pig.datastorage.DataStorageElementDescriptor; public interface DataStorageContainerDescriptor extends DataStorageElementDescriptor, Iterable { } }}}