From dev-return-59912-archive-asf-public=cust-asf.ponee.io@pig.apache.org Tue Oct 2 19:24:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 89D94180677 for ; Tue, 2 Oct 2018 19:24:04 +0200 (CEST) Received: (qmail 47962 invoked by uid 500); 2 Oct 2018 17:24:03 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 47951 invoked by uid 500); 2 Oct 2018 17:24:03 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 47948 invoked by uid 99); 2 Oct 2018 17:24:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2018 17:24:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 2B140C0144 for ; Tue, 2 Oct 2018 17:24:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id E8qk9wByWJVU for ; Tue, 2 Oct 2018 17:24:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1A45C5F181 for ; Tue, 2 Oct 2018 17:24:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 49B7FE0179 for ; Tue, 2 Oct 2018 17:24:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 12A7223F99 for ; Tue, 2 Oct 2018 17:24:00 +0000 (UTC) Date: Tue, 2 Oct 2018 17:24:00 +0000 (UTC) From: "Satish Subhashrao Saley (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PIG-5359) Reduce time spent in split serialization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-5359?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D166358= 47#comment-16635847 ]=20 Satish Subhashrao Saley commented on PIG-5359: ---------------------------------------------- Updated patch to review board > Reduce time spent in split serialization > ---------------------------------------- > > Key: PIG-5359 > URL: https://issues.apache.org/jira/browse/PIG-5359 > Project: Pig > Issue Type: Improvement > Reporter: Satish Subhashrao Saley > Assignee: Satish Subhashrao Saley > Priority: Major > > 1. Unnecessary serialization of splits in Tez. > In LoaderProcessor, pig calls > [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/had= oop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172] > {code:java} > tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplit= sToMem(conf, false, 0)); > {code} > It ends up serializing the splits, just to print log. > [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/or= g/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317] > {code:java} > public static InputSplitInfoMem generateInputSplitsToMem(Configuration = conf, > boolean groupSplits, boolean sortSplits, int targetTasks) > throws IOException, ClassNotFoundException, InterruptedException { > .... > .... > LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", Serial= izedSize: " > + splitInfoMem.getSplitsProto().getSerializedSize()); > return splitInfoMem; > {code} > [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/or= g/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106] > {code:java} > public MRSplitsProto getSplitsProto() { > if (isNewSplit) { > try { > return createSplitsProto(newFormatSplits, new SerializationFactor= y(conf)); > {code} > [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/or= g/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170] > {code:java} > private static MRSplitsProto createSplitsProto( > org.apache.hadoop.mapreduce.InputSplit[] newSplits, > SerializationFactory serializationFactory) throws IOException, > InterruptedException { > MRSplitsProto.Builder splitsBuilder =3D MRSplitsProto.newBuilder(); > for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) { > splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, s= erializationFactory)); > } > return splitsBuilder.build(); > } > {code} > [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/or= g/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259] > 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the I= nputSplits serialized in MRSplitsProto are not used by Pig and it serialize= s again directly to disk via JobSplitWriter.createSplitFiles. So the InputS= plit serialization logic is called again which is wasteful and expensive in= cases like HCat. > [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hado= op/executionengine/tez/TezDagBuilder.java#L946-L947] > {code:java} > MRSplitsProto splitsProto =3D inputSplitInfo.getSplitsProto(); > int splitsSerializedSize =3D splitsProto.getSerializedSize(); > {code} > The getSplitsProto, creates MRSplitsProto which consists of list of MRSpl= itProto. MRSplitProto has serialized bytes of each InputFormat. If splitsSe= rializedSize > spillThreshold, pig writes the splits to disk via > {code:java} > if(splitsSerializedSize > spillThreshold) { > inputPayLoad.setBoolean( > org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA= _EVENTS, > false); > // Write splits to disk > Path inputSplitsDir =3D FileLocalizer.getTemporaryPath(pc); > log.info("Writing input splits to " + inputSplitsDir > + " for vertex " + vertex.getName() > + " as the serialized size in memory is " > + splitsSerializedSize + ". Configured " > + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD > + " is " + spillThreshold); > inputSplitInfo =3D MRToTezHelper.writeInputSplitInfoToDisk( > (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadCon= f, fs); > {code} > [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hado= op/executionengine/tez/TezDagBuilder.java#L960] > [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/had= oop/executionengine/tez/util/MRToTezHelper.java#L302-L314] > Solution: > 1. Do not serialize the split in LoaderProcessor.java > 2. In TezDagBuilder.java, serialize each input split and keep adding its= size and if it exceeds spillThreshold, then write the splits to disk reusi= ng the serialized buffers for each split. > =C2=A0 > Thank you [~rohini] for identifying the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)