hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Parts of a file as input
Date Tue, 27 Mar 2012 06:09:08 GMT
Franc,

With the given info, all we can tell is that it is possible but we
can't tell how as we have no idea how your data/dimensions/etc. are
structured. Being a little more specific would help.

It is possible to select and pass the right set of inputs per job, and
to also implement record readers to only read what is needed
specifically. This all depends on how your files are structured.

Taking a wild guess, Apache Hive with its columnar storage (RCFile)
format may also be what you are looking for.

On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter
<franc.carter@sirca.org.au> wrote:
> Hi,
>
> I'm very new to Hadoop and am working through how we may be able to apply
> it to our data set.
>
> One of the things that I am struggling with is understanding if it is
> possible to pass tell Hadoop that only parts of the input file will be
> needed for a specific job. The reason I believe I may need this is that we
> have two big dimensions in our data set. Queries may want only one of these
> dimensions and while some un-needed reading is unavoidable there are cases
> that reading the entire data set presents a very significant overhead.
>
> Or have I just misunderstood something ;-(
>
> thanks
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferranti@sirca.org.au>
>
> franc.carter@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 9236 9118
>
> Level 9, 80 Clarence St, Sydney NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215



-- 
Harsh J

Mime
View raw message