From issues-return-41191-archive-asf-public=cust-asf.ponee.io@kylin.apache.org Fri Nov 13 01:50:02 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 0CE5F18066B for ; Fri, 13 Nov 2020 02:50:02 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 410DC479AB for ; Fri, 13 Nov 2020 01:50:01 +0000 (UTC) Received: (qmail 45832 invoked by uid 500); 13 Nov 2020 01:50:01 -0000 Mailing-List: contact issues-help@kylin.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kylin.apache.org Delivered-To: mailing list issues@kylin.apache.org Received: (qmail 45805 invoked by uid 99); 13 Nov 2020 01:50:00 -0000 Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2020 01:50:00 +0000 Received: from jira2-he-de.apache.org (static.54.33.119.168.clients.your-server.de [168.119.33.54]) by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPS id BDD263EAD9 for ; Fri, 13 Nov 2020 01:50:00 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 0B94EC80513 for ; Fri, 13 Nov 2020 01:50:00 +0000 (UTC) Date: Fri, 13 Nov 2020 01:50:00 +0000 (UTC) From: "ASF GitHub Bot (Jira)" To: issues@kylin.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KYLIN-4810) TrieDictionary is not correctly build MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/KYLIN-4810?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1723= 1081#comment-17231081 ]=20 ASF GitHub Bot commented on KYLIN-4810: --------------------------------------- hit-lacus commented on pull request #1477: URL: https://github.com/apache/kylin/pull/1477#issuecomment-726455006 @bigxiaochu Could you please help to review this patch ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org > TrieDictionary is not correctly build > ------------------------------------- > > Key: KYLIN-4810 > URL: https://issues.apache.org/jira/browse/KYLIN-4810 > Project: Kylin > Issue Type: Bug > Components: Job Engine > Affects Versions: v2.3.2 > Reporter: ShengJun Zheng > Priority: Critical > Fix For: Future > > > Hi, recently, I've met a problem in our product environment: Segments fai= led to merge because TrieDictionaryForest was disordered > {code:java} > java.lang.IllegalStateException: Invalid input data. Unordered data canno= t be split into multi trees > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.TrieDictionaryForestBuil= der.addValue(TrieDictionaryForestBuilder.java:92) > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.TrieDictionaryForestBuil= der.addValue(TrieDictionaryForestBuilder.java:78) > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.DictionaryGenerator$Stri= ngTrieDictForestBuilder.addValue(DictionaryGenerator.java:214) > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.DictionaryGenerator.buil= dDictionary(DictionaryGenerator.java:81) > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.DictionaryGenerator.buil= dDictionary(DictionaryGenerator.java:65) > =C2=A0=C2=A0=C2=A0=C2=A0at org.apache.kylin.dict.DictionaryGenerator.merg= eDictionaries(DictionaryGenerator.java:106) > {code} > After some analysis, we found out when there is large values in a dict-en= coded column, iterating over a single TrieDictionaryTree will get unordered= data. > =C2=A0 > Digging into the source code, =C2=A0the root cause is as described:=C2= =A0 > # Kylin will split a TrieTree Node into two parts when a single nodes's = value length is more than 255 bytes > # Then, these tow parts of value will be added to build the TrieTree. In= fact=C2=A0the splitted two parts should not be used as new values to add t= o the TrieTree. > # Step 2 will cause the TrieDictionaryTree build more leave nodes=EF=BC= =8Cand the extra leaf nodes will be 'end-value' of dictionary tree; > # It has no impact to the correctness of the dict tree itself, except fo= r adding some additional=C2=A0nodes=C2=A0. > # But If you spit a UTF-8 word, you will get unordered data when iterati= ng over the tree ( Something todo with Java UTF-8=C2=A0=C2=A0String Seriali= ze/Deserialize implementations. Please Refer to JDK sun.nio.cs.UTF_8.class) > How to re-produce ? Run test code : > {code:java} > TrieDictionaryForestBuilder builder =3D new TrieDictionaryForestBuilder(n= ew StringBytesConverter()); > String longUrl =3D "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=E4=BD=A0=E5=A5=BD~~~"; > builder.addValue(longUrl); > TrieDictionaryForest dict =3D builder.build(); > TrieDictionaryForestBuilder mergeBuild =3D new TrieDictionaryForestBuilde= r(new StringBytesConverter()); > for (int i =3D dict.getMinId(); i <=3D dict.getMaxId(); i++) { > =C2=A0=C2=A0=C2=A0 String str =3D dict.getValueFromId(i); > =C2=A0=C2=A0=C2=A0 System.out.println("add value into merge tree"); > =C2=A0=C2=A0=C2=A0 mergeBuild.addValue(str); > } > The log output of this test code is: > add value into merge tree > add value into merge tree > 16:59:36 [main] INFO org.apache.kylin.dict.TrieDictionaryForestBuilder.ad= dValue(TrieDictionaryForestBuilder.java:127) values not in ascending order,= previous 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xEF\xBF\xBD', current 'xxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= xxxxxxxxxxxxxxxxxx\xE4\xBD\xA0\xE5\xA5\xBD~~~' > {code} > We can see from the test code's output=EF=BC=9A > # We only add 1 value but the tire dictionary tree turn out to have 2 en= d vlaues > # Iterating over the TrieDictionary Tree got unordered data > We address this problem by > # classify values which is a whole column value, which is splitted value= , > # not mark splitted value as end-value of a TrieTree Node. > I wonder if there is something wrong, thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)