incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: From.formattedFile - questions
Date Tue, 18 Dec 2012 17:45:49 GMT
Thanks Josh - yes, I see what you mean, the ambiguity is a problem. For my
current use case I want to ignore the key (like TextInputFormat) but I can
think of times where I'd want to ignore the value too. I'll use the keys()
and values() methods instead.

Dave


On 18 December 2012 17:38, Josh Wills <jwills@cloudera.com> wrote:

> Hey Dave,
>
> Replies inlined.
>
>
> On Tue, Dec 18, 2012 at 9:27 AM, Dave Beech <dave@paraliatech.com> wrote:
>
> > Hi devs,
> >
> > I'm looking at the static factory methods on the From class, which
> produce
> > Source or TableSource objects to form input to a Crunch job.
> >
> > Couple of questions:
> > - Is there a reason why there are only TableSource versions of the
> > formattedFile methods which can take a custom input format? I'd find a
> > Source version of these which ignore the key quite useful. (I've already
> > knocked up a quick patch, but I just wanted to sound you out before I
> went
> > ahead and tidied it up properly or created a JIRA for it.)
> >
>
> I think it was the ambiguity about which of the two fields (key or value)
> should be ignored-- with SequenceFiles, it's usually the key that is
> irrelevant, but with Avro files, it's the value that is ignored. My feeling
> was that it's easy to convert to a PCollection<K> or PCollection<V> from
> PTable<K, V> via the keys() and values() methods on PTable, and of course
> you can create your own Sources whenever it's useful.
>
>
> >
> > - How come FileSourceImpl is abstract but FileTableSourceImpl is not? In
> my
> > patch mentioned above, I've had to remove abstract from FileSourceImpl,
> so
> > I'm keen to know if that would break anything.
> >
>
> That's probably just an artifact of an earlier revision-- I don't see where
> it would have to be abstract based on the current impl.
>
>
> >
> > Cheers,
> > Dave
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message