From java-user-return-63532-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Tue Jan 23 04:27:50 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id B240F180609 for ; Tue, 23 Jan 2018 04:27:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A2279160C4C; Tue, 23 Jan 2018 03:27:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EA225160C4B for ; Tue, 23 Jan 2018 04:27:49 +0100 (CET) Received: (qmail 9378 invoked by uid 500); 23 Jan 2018 03:27:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 9366 invoked by uid 99); 23 Jan 2018 03:27:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jan 2018 03:27:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 35644C2E4F for ; Tue, 23 Jan 2018 03:27:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id D-aQNmpRWMe2 for ; Tue, 23 Jan 2018 03:27:46 +0000 (UTC) Received: from mail-it0-f48.google.com (mail-it0-f48.google.com [209.85.214.48]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BA63B5F2AB for ; Tue, 23 Jan 2018 03:27:45 +0000 (UTC) Received: by mail-it0-f48.google.com with SMTP id q8so12190649itb.2 for ; Mon, 22 Jan 2018 19:27:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=Fr9AYaUzYX8yK7yIjMwKrxm8sUG74wvh/hEEcuSSmjo=; b=NXIfTXuPhKezl+Cf0cZ/7a1coaki2eTAtZ53wlB2kHddKaRXnveBcYoLPgc3ghQCWM EfUxoz2mYpHs8GF1Dv6OrlyVJMw4G3xSamyTH3kOV40CB7r/0GmYL6cfHkyj90EVFBRO k5NNYvGNL6xVU+mSPOGcVenI5DxbLMh7Y4i280rvZMK2o/+tdSFM+PC+5OpXuwD69dPp CB2J9dIBiBkholwEXKG9h+VIFSzvFJH97mF3BLYhhBJl/v95j4je5wF2vxVctl0AMr/0 TX+K/7oYyc0iyrfkzQR0VGN/8c8OYTk5jVJ1yNp4RfnCQ5yEpnobPUpp74nk93N2tYbV yLwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Fr9AYaUzYX8yK7yIjMwKrxm8sUG74wvh/hEEcuSSmjo=; b=Ri4FLXbQSazA6R/+JRJc27JUj5jyXL0iT4oIRVihCZr5sQUhQGFc9CON5khyXz37Ls 3r154zpu05WKaCdI5K0PJy685rfCBxZ5ok3RJBZALdYi39iGLqT2R8XjaYTznGK0pSZY 0HvFAU8yTXaIZQe0D/gSMi3iP6nKEHDALxIKB63gBLa4Iii4zteLJSPQrn+lcc1mcjV3 FbAIpT8xVG4h1X/VlKSoOy8gI46vyvaSROCisWxNUT/Nefd88Cq/vafSqwlEVg/nuklu l7aDjECXgG/EAgTKbUJ/2MRa9x+WPVA8RlvfkFsOqbCNEVmRfGfGfYjygGS5ZVUqyJ/U /OGw== X-Gm-Message-State: AKwxytd2QGFwDkz0xpku3iSinDnf+ghOVhlb/XeE075EsbGoeBUfnq+f OhoEr/FVljlUH9dYih1ey4C77WRgEiY4WL0bY88= X-Google-Smtp-Source: AH8x227n0bDpnSC5XGNvgHl33rWJhfHHv9ob2vkhA1jJ+1+X7GBeSdHb+s9rkACuBuQpbRSJdmtgzOwx8ynRflOIM2Q= X-Received: by 10.36.117.195 with SMTP id y186mr1742731itc.111.1516678064220; Mon, 22 Jan 2018 19:27:44 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.192.130 with HTTP; Mon, 22 Jan 2018 19:27:13 -0800 (PST) From: Armins Stepanjans Date: Tue, 23 Jan 2018 03:27:13 +0000 Message-ID: Subject: Format of Wikipedia Index To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="001a114a9e0841ae5a05636921a8" --001a114a9e0841ae5a05636921a8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, I have a question regarding the format of the Index created by DocMaker, from EnWikiContentSource. After creating the Index from dump of all Wikipedia's articles ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest- pages-articles-multistream.xml.bz2), I'm having trouble understanding the format of Documents created, because when I get a document from the Index, its only field is docid. Is this an indicator of incorrect indexation and if not, how should I use the index, in order to search for occurrences of a term, within an article (I was imagining of doing a boolean query, with on sub-query being the article's name and the other the term I'm searching for within the article)= ? Regards, Arm=C4=ABns --001a114a9e0841ae5a05636921a8--