From users-return-19536-archive-asf-public=cust-asf.ponee.io@jena.apache.org Thu Mar 7 13:05:42 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2EE54180654 for ; Thu, 7 Mar 2019 14:05:42 +0100 (CET) Received: (qmail 26344 invoked by uid 500); 7 Mar 2019 13:05:36 -0000 Mailing-List: contact users-help@jena.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jena.apache.org Delivered-To: mailing list users@jena.apache.org Received: (qmail 26333 invoked by uid 99); 7 Mar 2019 13:05:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Mar 2019 13:05:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1713A180E61 for ; Thu, 7 Mar 2019 13:05:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.101 X-Spam-Level: **** X-Spam-Status: No, score=4.101 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_SBL=4, URIBL_SBL_A=0.1] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9AznHTAZguuL for ; Thu, 7 Mar 2019 13:05:33 +0000 (UTC) Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6895160D0E for ; Thu, 7 Mar 2019 13:05:33 +0000 (UTC) Received: by mail-wr1-f48.google.com with SMTP id f14so17340392wrg.1 for ; Thu, 07 Mar 2019 05:05:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=z6ws531o//up227tZrz+P24Pn+oppHzYBSD0tYI4M5o=; b=J4glrMC/egDmWut9UqTeJ7NpON+D+/6Y81ZnwG6t1m1rULyn0+4rKxFq4uvU//mAqN Lu+iI+NsKhp4D2CHsWkJ+fA0l0Fz5VirHHpWLzOlUxVa8cbjTbxZkJDf1BKWqtAk7lmc p1g9jqMIvY+t+xFQSaHXait2q3tR/zq7vgGFfbxtUyKkMhozHSBIvedui47evshhWk+w 05eyYKjYDFYJKdZjRWCUVXU7SDK9vreTYcfOJhncGPBB1YiefsUmOHfA8gEbv4ArWabZ BffuBOk6xgKPlxiIPVJ3gnL68Kif9hwYpVqez8Muylojlt8zud0FLRAqO4n13y0o+Agv KW7A== X-Gm-Message-State: APjAAAVbY/Drz4nETUIFr0C6BQR7UVkdbzeltd4sNIFno/4QiScJvFGP OXsowJg3dCX1xhg70Jkls0oWNwgxOgw= X-Google-Smtp-Source: APXvYqxrcsjgFwU3MmH/EG7Ec4utOGMQePAWJSdYRdfOc94ouipEVqHIlijwdWCLRF9b/WTT+yOvNQ== X-Received: by 2002:adf:90af:: with SMTP id i44mr6723453wri.222.1551963931893; Thu, 07 Mar 2019 05:05:31 -0800 (PST) Received: from [192.168.1.13] (cpc85428-aztw29-2-0-cust363.18-1.cable.virginm.net. [82.38.145.108]) by smtp.googlemail.com with ESMTPSA id l130sm5837549wmf.13.2019.03.07.05.05.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Mar 2019 05:05:29 -0800 (PST) Subject: Re: Storing a lot of strings in TDB store To: users@jena.apache.org References: <6B4F408B-817F-4204-B433-5340DD563506@dotnetrdf.org> From: Andy Seaborne Message-ID: <7fce9c1d-e0fa-b148-f19a-2c6e6efb5006@apache.org> Date: Thu, 7 Mar 2019 13:05:29 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit At the level of that description, they are much the same. TDB2 differs in actual inline encoding of literals (it keeps the datatype). TDB2 B+Trees are "copy on-write" (MVCC) and TDB2 has a different transaction mechanism resulting in arbitrary large transaction changes being supported. TDB2 bulkloader is much faster (although it could be backported to TDB1; it is not fundamental to the TDB2 disk layout). Andy On 06/03/2019 12:38, Siddhesh Rane wrote: > It's for TDB 1 right? Is there a document for TDB 2? I couldn't find one > > Regards > Siddhesh > > > On Fri, 22 Feb 2019, 8:48 pm Rob Vesse, wrote: > >> It's here - http://jena.apache.org/documentation/tdb/architecture.html >> >> Rob >> >> On 22/02/2019, 04:03, "Ekaterina Danilova" >> wrote: >> >> Thank you, it was exactly what I needed. It is still nice to hear what >> others think about my idea of data storage as resources and I think I >> will >> stick to that option, but TDB storage logic was quite unclear to me. >> Would >> be great if it was mentioned in official documentation since I couldn't >> find it. >> Thanks again for your help >> >> On Tue, 19 Feb 2019 at 20:40, Rob Vesse wrote: >> >> > Since I don't think anyone answered your specific original question >> > >> > TDB and TDB2 both use dictionary encoding (and in fact most RDF >> stores use >> > some variation on this). Basically they map each unique RDF term >> (whether >> > URI, string, blank node etc) to a consistent internal identifier and >> use >> > this to refer to the term. Therefore most data structures >> internally are >> > implemented in terms of these internal identifiers (which are >> typically >> > very compact, TDB/TDB2 use 64 bit identifiers) and the system only >> > translates between the internal identifier and the full RDF term when >> > explicitly needed e.g. when presenting results >> > >> > Rob >> > >> > On 15/02/2019, 06:03, "Ekaterina Danilova" < >> katja.danilova94@gmail.com> >> > wrote: >> > >> > i would like to ask how TDB2 and Fuseki manages big amounts of >> string >> > data >> > (especially repeating data) and what it the best practices. Does >> it >> > optimize it somehow? >> > >> > >> > >> > >> > >> >> >> >> >> >> >