incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Crunch Experiences
Date Mon, 18 Jun 2012 05:01:14 GMT
Oh, and in reply to the question (many threads ago) about
globs/directories for input paths: yes, they work just fine.

On Sun, Jun 17, 2012 at 9:59 PM, Josh Wills <jwills@cloudera.com> wrote:
> Hey Stefan,
>
> Reply inlined.
>
> On Sat, Jun 16, 2012 at 6:03 AM,  <acki.ch@gmail.com> wrote:
>> Hey Josh,
>>
>> @TupleWritables: Yes I saw that. Didn't think too much about the performance
>> implications though. Writing all the classinfo is unnecessary because it is
>> statically known, and an identifier will help with that. But using
>> reflection will use much more CPU time than if you can avoid it, as proven
>> by my little experiment, and it will not help with that.
>>
>> Kryo uses a neat trick for cutting down serialization size. By forcing the
>> user to register all classes in well defined sequence, it can use the index
>> for a class in that sequence to describe it. Not as good as describing a
>> whole schema, but generally applicable for serialization.
>>
>> Using avro internally sounds like a good idea. I don't know it too well
>> though. So you would give up the TupleWritable's completely, replace them
>> with AvroWritables, and use the mentioned commit to use Writables within
>> Avro? What do you mean with "I think there would be some cool stuff we could
>> do if we could assume how all shuffle serializations worked, ..."?
>> Well the reflection call would stay, but that can be solved independently I
>> guess (for example by having an optional factory argument in
>> BytesToWritableMapFn).
>
> Re: cool stuff, I suspect that having Avro for all of the intermediate
> data transfer would dramatically simplify the implementation of MSCR
> fusion, which I've been putting off b/c having an abstraction that
> would handle fusion for both Writables and Avro makes my head hurt.
> I'm going to start a thread on crunch-dev@incubator.apache.org
> advocating for it.
>
>>
>> What still bothers me is that the partitioner for the join, which only has
>> to ensure that the bit indicating whether it's left or right is ignored,
>> still has to do much too much work. If he could just get the serialized
>> bytes for the key, he could compute the hashcode for the byte array
>> directly, just ignoring the last byte. The deserialization there is
>> unnecessary and is actually, at least in my profiling, what seemed to hurt
>> the performance of the join so badly. Maybe some more benchmarking is needed
>> though for this.
>
> That doesn't sound right-- isn't the Partitioner call is done at the
> end of the map task, before the data is ever serialized?
>
>>
>> Guess I should contact Alexy Khravbrov then.
>>
>> Cheers,
>> Stefan
>>
>> PS: Sorry, replied directly instead to the mailing list.
>>
>> Am Freitag, 15. Juni 2012 17:37:06 UTC+2 schrieb Josh Wills:
>>>
>>> On Fri, Jun 15, 2012 at 2:24 AM,  <acki.ch@gmail.com> wrote:
>>> > Hey Josh,
>>> >
>>> > I actually first found pangool and from there concluded that Crunch is
>>> > worth
>>> > a try. So you are saying that the TupleWritable's are in general quite
>>> > slow,
>>> > and the performance of Crunch is in that case not comparable to pure
>>> > Hadoop?
>>> > If it really makes a difference we should look into generating
>>> > TupleWritable
>>> > subclasses for our project (if there will be another round of
>>> > benchmarks).
>>>
>>> The impl of TupleWritable in Crunch is too conservative, in that it
>>> passes along the names of the Writable classes that it serialized in
>>> the tuple along with the actual data, which leads to a pretty massive
>>> blowup in the amount of data that gets passed around. My initial
>>> thought in doing this was that I wanted to be sure that the Crunch
>>> output was always readable by anything-- e.g., you didn't have to use
>>> Crunch in order to read the data back, since all of the information on
>>> what the data contained was there. This is essentially what Avro gets
>>> you, although the Avro file format is smarter about just putting the
>>> schema at the head of the file and not copying it over and over again,
>>> which is why we generally recommend Avro + Crunch. The #s in the
>>> pangool benchmark were all based on Avro.
>>>
>>> The tuple-oriented frameworks handle this by having integer
>>> identifiers for the different data types are supported, and that's
>>> certainly one way to improve the performance here. I've also been
>>> tossing around the idea of using Avro for everything internal to the
>>> framework (e.g., any data transferred during the shuffle stage), since
>>> I just added a way for AvroTypes to support arbitrary writables:
>>>
>>>
>>> https://github.com/cloudera/crunch/commit/224102ac4813fc0e124114026438a2e3884f858b
>>>
>>> I think there would be some cool stuff we could do if we could assume
>>> how all shuffle serializations worked, but I haven't benchmarked it
>>> yet or tossed the idea around with the other committers.
>>>
>>> >
>>> > You didn't comment on the the SourceTargetHelper. Do glob patterns now
>>> > work
>>> > fine as inputs? It just failed in 0.2.4 when I specified lineitems* as
>>> > input
>>> > in local mode (no hdfs), while it worked fine in Scoobi.
>>>
>>> Sorry I missed that-- I was under the impression it did work, but I'll
>>> take a look at it and report back.
>>>
>>> >
>>> > Right now I just need to finish my thesis, so currently I am not
>>> > planning to
>>> > commit anywhere.
>>>
>>> It never hurts to ask. I saw your threads on the spark and scoobi
>>> mailing lists and enjoyed them-- please keep us posted on what you're
>>> up to. Alexy Khravbrov at Klout was also interested in an abstraction
>>> layer for Scala MapReduce vs. Spark-- did you talk to him?
>>>
>>> >
>>> > Cheers,
>>> > Stefan
>>> >
>>> > Am Donnerstag, 14. Juni 2012 16:42:43 UTC+2 schrieb Josh Wills:
>>> >>
>>> >> Hey Stefan,
>>> >>
>>> >> Thanks for your email. Re: join performance, I agree with you that the
>>> >> current implementation that uses Writables is pretty terrible, and
>>> >> I've been thinking about good ways to do away with it for awhile now.
>>> >> The good news is that the Avro-based implementation of PTypes is
>>> >> pretty fast, and that's what most people end up using in practice. For
>>> >> example, when the Pangool guys were benchmarking frameworks, they used
>>> >> the Avro PTypes, and it usually runs pretty close to native MR:
>>> >> http://pangool.net/benchmark.html
>>> >>
>>> >> Re: the gist, I will take a look at it. The good news about the pull
>>> >> request is that we just submitted Crunch to the Apache Incubator and
>>> >> are in the process of moving over all of the infrastructure. It would
>>> >> be great to have your patches in when we're done-- we're always on the
>>> >> lookout for people who are interested in becoming committers.
>>> >>
>>> >> Best,
>>> >> Josh
>>> >>
>>> >> On Thu, Jun 14, 2012 at 6:53 AM,  <acki.ch@gmail.com> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I did some more investigations. You set the partitioner correctly,
>>> >> > but
>>> >> > you
>>> >> > do not set the comparator. But actually the comparator might not
be
>>> >> > needed,
>>> >> > because the value with 0 will always come first by the default
>>> >> > comparator?
>>> >> > If you rely on that, then maybe you should put a comment there
as it
>>> >> > is
>>> >> > not
>>> >> > immediately obvious.
>>> >> >
>>> >> > The performance problem with joins (as shown by profiling) is that
>>> >> > Hadoop
>>> >> > does not know about the PType's. So to deserialize during sorting
the
>>> >> > general TupleWritable is used, which is very inefficient as it
uses
>>> >> > reflection. I added a cache for the Class.forName call  in readFields
>>> >> > and it
>>> >> > improves performance a lot, but it is not a definitive answer to
the
>>> >> > problem. Maybe you could add a special writable for TaggedKey's,
>>> >> > which
>>> >> > knows
>>> >> > that one of them is an Integer. To define the partitioning, the
>>> >> > actual
>>> >> > content does not matter, just that the tagging value is ignored.
>>> >> > Scoobi
>>> >> > even
>>> >> > generates classes for TaggedKeys and TaggedValues, that would most
>>> >> > likely be
>>> >> > even faster.
>>> >> >
>>> >> > Here are some numbers for a 1GB join on my laptop, second column
has
>>> >> > the
>>> >> > total amount of seconds.
>>> >> > My version of a join, which tags the values by adding a boolean
to
>>> >> > them.
>>> >> > Note that in most cases, where the size of one group does not come
>>> >> > close
>>> >> > to
>>> >> > the available RAM, this is totally fine. Implemented in [1].
>>> >> > crunch_4    33.42    13    2.06    26.82    6526368   
86%
>>> >> > Crunch's original join
>>> >> > crunch_4    150.11    13    2.79    143.50    7924496   
97%
>>> >> > Crunch's join with my caching changes in TupleWritable:
>>> >> > crunch_4    69.67    13    2.51    59.99    7965808   
89%
>>> >> >
>>> >> > Too lazy to open a pull request, here is a gist of my changes for
the
>>> >> > join.
>>> >> > https://gist.github.com/2930414
>>> >> > These are for 0.2.4, I hope it's still applicable for trunk.
>>> >> >
>>> >> > Regards,
>>> >> > Stefan Ackermann
>>> >> >
>>> >> > [1]
>>> >> >
>>> >> >
>>> >> > https://github.com/Stivo/Distributed/blob/master/crunch/src/main/scala/ch/epfl/distributed/utils/CrunchUtils.scala#L83
>>> >> >
>>> >> > Am Donnerstag, 14. Juni 2012 12:24:34 UTC+2 schrieb ack...@gmail.com:
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >> I am writing a Distributed Collections DSL with compiler
>>> >> >> optimizations
>>> >> >> as
>>> >> >> my master thesis. I have added Crunch as a backend, since we
were
>>> >> >> not
>>> >> >> happy
>>> >> >> with the performance we were getting from our other backend
for
>>> >> >> Hadoop.
>>> >> >> And
>>> >> >> I must say, I am quite impressed by Crunch's performance.
>>> >> >> Here is some sample code our DSL generates, in case you are
>>> >> >> interested:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> https://github.com/Stivo/Distributed/blob/bigdata2012/crunch/src/main/scala/generated/v4/
>>> >> >>
>>> >> >> We are impressed with the performance, except for the joins.
In
>>> >> >> local
>>> >> >> benchmarks, joins performed 100x worse than my own join
>>> >> >> implementation
>>> >> >> (no
>>> >> >> joke, 100x). Also it seems like the current implementation
has some
>>> >> >> bugs.
>>> >> >> You are setting the partitioner class to the comparator in
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> https://github.com/cloudera/crunch/blob/master/src/main/java/com/cloudera/crunch/lib/Join.java#L144
>>> >> >> . Also you are not setting the partitioner class. Seems to
me the
>>> >> >> code
>>> >> >> is
>>> >> >> all there, it's just not linked up correctly.
>>> >> >> For the performance, maybe you should have different versions
of the
>>> >> >> comparator / partitioner for different key types instead of
writing
>>> >> >> the
>>> >> >> key
>>> >> >> just to compare it. String comparison for example only checks
>>> >> >> characters
>>> >> >> until it finds a difference.
>>> >> >>
>>> >> >> I encountered lots of problems with SourceTargetHelper in 0.2.4.
I
>>> >> >> know
>>> >> >> you changed it since then, but I think I also had troubles
with the
>>> >> >> new
>>> >> >> version. Does it support using glob patterns or directories
as
>>> >> >> input?
>>> >> >> In any
>>> >> >> case, it should not prevent the program from running imho.
I spent
>>> >> >> quite a
>>> >> >> while just trying to work around that bug.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Stefan Ackermann
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >> --
>>> >> Director of Data Science
>>> >> Cloudera
>>> >> Twitter: @josh_wills
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera
>>> Twitter: @josh_wills
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message