hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "anishek (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16904) during repl load for large number of partitions the metadata file can be huge and can lead to out of memory
Date Wed, 30 Aug 2017 23:57:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148234#comment-16148234
] 

anishek commented on HIVE-16904:
--------------------------------

On internal runs we saw that for 10000 partitions with one file each it was creating a metadata
file of about ~ 16 MB. extrapolating this to include additional properties and files etc,
to 20 MB for 10000 Partitions then for 1 million its about 2GB.

Adding java object overhead to about another 50% we should still be using about 3 GB of RAM
to process this file which does not seem too large. 

So parking this for now and will come back to this later if there still an issue. 

Sample Code to allow doing this 

{code}

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.hive.metastore.api.Partition;
import org.apache.thrift.TDeserializer;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TJSONProtocol;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonNode;
import org.codehaus.jackson.JsonParser;
import org.codehaus.jackson.JsonToken;
import org.codehaus.jackson.map.MappingJsonFactory;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.JSONObject;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

import static org.junit.Assert.fail;

public class StreamingJsonTests {

  @Test
  public void testStreaming() throws IOException, TException {
    TDeserializer deserializer = new TDeserializer(new TJSONProtocol.Factory());
    ObjectMapper mapper = new ObjectMapper();
    JsonFactory factory = new MappingJsonFactory();
    printMemory("before reading file to parser");
    JsonParser parser =
        factory.createJsonParser(new File("_metadata"));
    if (parser.nextToken() != JsonToken.START_OBJECT)
      fail("cant parse the files");
    for (JsonToken jsonToken = parser.nextToken();
         jsonToken != JsonToken.END_OBJECT; jsonToken = parser.nextToken()) {
      if (parser.getCurrentName().equalsIgnoreCase("partitions")) {
        break;
      }
    }
    int count = 0;
    printMemory("after finding out the partitions object location");
    if (parser.nextToken() == JsonToken.START_ARRAY) {
      while (parser.nextToken() != JsonToken.END_ARRAY) {
        JsonNode jsonNode = mapper.readTree(parser);
        Partition partition = new Partition();
        deserializer.deserialize(partition, jsonNode.asText(), "UTF-8");
        count++;
      }
      System.out.println("number of partitions :" + count);
    } else {
      fail("no partitions array token");
    }
    parser.close();
  }

  @Test
  public void testRegular() throws IOException {
    printMemory("before starting");
    JSONObject jsonObject = new JSONObject(
        FileUtils.readFileToString(new File("_metadata")));
    printMemory("after reading the file");
    jsonObject.toString();
  }

  private void printMemory(String msg) {
    Runtime runtime = Runtime.getRuntime();
    runtime.gc();
    long usedMemory = runtime.totalMemory() - runtime.freeMemory();
    System.out.println(msg + " KB used : " + usedMemory / 1024);
  }

}

{code}

Additional problem to look at is the overhead that bootstrap creates on namenode. all partitions
will have their own directory hierarchy ( for multiple partition columns per table) to store
the {{_files}}. 

> during repl load for large number of partitions the metadata file can be huge and can
lead to out of memory 
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16904
>                 URL: https://issues.apache.org/jira/browse/HIVE-16904
>             Project: Hive
>          Issue Type: Sub-task
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>             Fix For: 3.0.0
>
>
> the metadata pertaining to a table + its partitions is stored in a single file, During
repl load all the data is loaded in memory in one shot and then individual partitions processed.
This can lead to huge memory overhead as the entire file is read in memory. try to deserialize
the partition objects with some sort of streaming json deserializer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message