From user-return-1356-archive-asf-public=cust-asf.ponee.io@kudu.apache.org Tue May 8 18:10:22 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5CCF418063B for ; Tue, 8 May 2018 18:10:21 +0200 (CEST) Received: (qmail 78850 invoked by uid 500); 8 May 2018 16:10:20 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 78830 invoked by uid 99); 8 May 2018 16:10:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2018 16:10:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 517A4C0202 for ; Tue, 8 May 2018 16:10:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id xG9zBRvTcDSj for ; Tue, 8 May 2018 16:10:18 +0000 (UTC) Received: from mail-lf0-f46.google.com (mail-lf0-f46.google.com [209.85.215.46]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E3CC25FBF7 for ; Tue, 8 May 2018 16:10:17 +0000 (UTC) Received: by mail-lf0-f46.google.com with SMTP id p85-v6so11280006lfg.7 for ; Tue, 08 May 2018 09:10:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Xfh2ufRtF5CDlAERD4bZfQkE4G2Unl+rqZ+JS7vQzcs=; b=G91P7pDANtXuj38kDrs2MqkTNd4RnkSoP5WGQJJ2UCYOk/X4HgiJLe4YqCw66Tkn8G 4Rg+roUTbjANE6QTrGEDZgT6iDLOuVei//ZvDbkDznYpJ082zynN7z1vbuWbjXDnqL9P VdD7Czp9QrSfCmQSM9rVp9jpyQhMDoX0FCwNdmDR7BCbaQXpKKT0K19lyeLAl6pBUQ68 SpO/fdEWIWvKHqTvCKLNS9iU4Jlb5OrVM1qUvVGxJwF8A9+73tYifnwB6VNn8n30l2CS 2EJLMAil2G0HWQTk+OJgshUXPrsRVTYNg6BE/gfHUUHTlvtILNd49NG4mnCqCt3thGBf fjbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Xfh2ufRtF5CDlAERD4bZfQkE4G2Unl+rqZ+JS7vQzcs=; b=R6m4Kvr3ufQhTEy5R7wPGYmTLvs0/en92QnmopAcRP6eLV88AN4dSix7Ll8/YXDFf0 nc3BhA/NMD+2Gww8j7wTFcSdpfHN/KOpLcwrK4vLYryc68Kqf70ceo3G/P+5Nqq1dMax E0vmt3PIsVEpGJ5ouJbF8p+lxI/UNm5nTxRq6Mv09URfcMs8ZPpofLl69Xn1RYYPxFGd FI5kGgm+aWdSL4fHz1/QknXTGENDQ8hXKbYyl7BfrxeWiENWHzydc0psy5yl7W5t79x0 gduFy5t+ODqF7tY19lqOCRBVc7/GNKvJcrHE2n6++YOAmLP9PaL/MbhzhczwfLSkLRNX JhUQ== X-Gm-Message-State: ALQs6tDLFqmh2zMQegyy2LxLr7zEEI9aIMbgAOggmlhMQ5LjGO84KEXU 4OtlRX3nbejPLBqQiz3FdF4ntT6z+9KqdmRVeRYKFiGT X-Google-Smtp-Source: AB8JxZrQHewZf6geo47CIiqn8cLe7A7p73F6lRcmlffkSCb9zz1T8u+NlvgCSOqlutqd2TS4UGJtkM9mQ3YNap42ydk= X-Received: by 2002:a2e:9ed7:: with SMTP id h23-v6mr8186979ljk.88.1525795816043; Tue, 08 May 2018 09:10:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.46.82.138 with HTTP; Tue, 8 May 2018 09:09:55 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Tue, 8 May 2018 09:09:55 -0700 Message-ID: Subject: Re: Column Compression and Encoding To: user@kudu.apache.org Content-Type: multipart/alternative; boundary="0000000000009d91bf056bb40567" --0000000000009d91bf056bb40567 Content-Type: text/plain; charset="UTF-8" Hi Saeid, We've tried to make the default compression/encoding a reasonable tradeoff of performance for most common workloads. A couple quick tips I've found from my experiments: - high-cardinality strings won't be automatically compressed by dictionaries. So, if you have such a large string that might have repeated substrings (eg a set of URLs) then enabling LZ4 compression is a good idea. - if you have strings with a lot of common prefixes, you might consider PREFIX_ENCODING - for integer types, choose the smallest size that fits your intended range. eg don't use int64 for storing a customer's age. On disk it will compress to about the same size, but in memory it will use a lot more space with the larger type. Perhaps others can jump in with further recommendations based on experience. -Todd On Mon, May 7, 2018 at 1:45 AM, Saeid Sattari wrote: > Hi all, > > Folks who have used the column compression and encoding in Kudu tables: > can you share your experiences with the performance? What type of fields > are worse/better (IO bottleneck vs query return time,..) to compress. We > can collect a knowledge base regarding these subjects that users can use in > the future. Thanks. > > Regards, > > -- Todd Lipcon Software Engineer, Cloudera --0000000000009d91bf056bb40567 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Saeid,

We've tried to make the d= efault compression/encoding a reasonable tradeoff of performance for most c= ommon workloads. A couple quick tips I've found from my experiments:
- high-cardinality strings won't be automatically compr= essed by dictionaries. So, if you have such a large string that might have = repeated substrings (eg a set of URLs) then enabling LZ4 compression is a g= ood idea.
- if you have strings with a lot of common prefixes, yo= u might consider PREFIX_ENCODING
- for integer types, choose the = smallest size that fits your intended range. eg don't use int64 for sto= ring a customer's age. On disk it will compress to about the same size,= but in memory it will use a lot more space with the larger type.

Perhaps others can jump in with further recommendations bas= ed on experience.

-Todd

On Mon, May 7, 2018 at 1:45 AM, = Saeid Sattari <saeid.sattari@gmail.com> wrote:
Hi all,

=
Folks who have used the column compression and = encoding in Kudu tables: can you share your experiences with the performanc= e?=C2=A0 What type of fields are worse/better (IO bottleneck vs query retur= n time,..) to compress. We can collect a knowledge base regarding these sub= jects that users can use in the future. Thanks.

Regards,




--
Todd Lipcon
Soft= ware Engineer, Cloudera
--0000000000009d91bf056bb40567--