From solr-user-return-139514-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Fri Mar 2 11:15:37 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 25E5C1807CD for ; Fri, 2 Mar 2018 11:15:34 +0100 (CET) Received: (qmail 97977 invoked by uid 500); 2 Mar 2018 10:05:59 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 97958 invoked by uid 99); 2 Mar 2018 10:05:58 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2018 10:05:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 66A481A09F7 for ; Fri, 2 Mar 2018 10:05:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.83 X-Spam-Level: X-Spam-Status: No, score=0.83 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, KAM_COUK=0.85, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=flax-co-uk.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 0XclvAaJluD3 for ; Fri, 2 Mar 2018 10:05:56 +0000 (UTC) Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com [74.125.82.48]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 448915F39F for ; Fri, 2 Mar 2018 10:05:56 +0000 (UTC) Received: by mail-wm0-f48.google.com with SMTP id t3so2059230wmc.2 for ; Fri, 02 Mar 2018 02:05:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flax-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=25NhjIDRho/0xLtB3GffvXn6wAlRybf+AHO9z7RrLpo=; b=uDhUTvSzSjVqoPdWXVXt1qsADYyovP3WXqak1GhQEtzYIcdfv5sVlIocrUeXXdxTDw NH1LsQUEbyHbT3dqKvimoMVjvnydeB6zdcteuIAHPT5G+0eetasyINxflkv8FOxcRI/B gcwp76U4IN/PZwo8I5TFQFUrWksSq0kUgrOAQMyqAmUrFfb+GH+UF56SFRE5k1pArW/V k0G2D1QBKUqFFrdXcIZhkXlqy4l+INi1dcnqhj8mRwYalM4oT7oI82GeWE0Fiu5NAV9u TVJGk0fYN/Bvqt1D87bktIva/JDeaYBCAAMCWh9ETVHSYwVFU+NZ91UoqAxelt4ZvvEq lXKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=25NhjIDRho/0xLtB3GffvXn6wAlRybf+AHO9z7RrLpo=; b=h0FzTxCrYDYn+GZYeCjyydI3XSnDtGK38s/yRnz58iRcd90vuHbUaT6BPKAhiTqKAY XxYP22/I1fWSF/cNpnwda+4UeVRoAGIaByZr9v4TxcVu8tdZZiBEOhTH7GFXTdsh7eOs EcsVDFCbRZPMueLyDohryD4wkX9ykss450Dlqv42TnoxJOVKylfeltNE+GpxZuG7XYj3 eoCvXVtCSEWXV8210XJVPiwonDnBWReb7D6j4olWjx8Gs/rkW4QsXhpWWnng3aR1sghq 4HlRse9m0VAtGh+txfr4/UrCU4G/DXNo+8z2j3Uqd84HHAioYnza94O0KSg3OocHL2IP ElBQ== X-Gm-Message-State: APf1xPClEt4erBW1tlb/Aq9gOJghlNdg2SacGDjhGQVgdTxJsJrzD5Qq bRCFeWkKUespcd4AdqFxYB5LBCEq X-Google-Smtp-Source: AG47ELsGJnm5Jt1HKTCVjieA3YDgH/yzF6859KRPvMyAwJwQycBun0irgPzmscGYjG9E5rQdCnyIkg== X-Received: by 10.80.220.200 with SMTP id v8mr6430405edk.49.1519985154796; Fri, 02 Mar 2018 02:05:54 -0800 (PST) Received: from [192.168.1.79] (charliejuggler.plus.com. [80.229.29.33]) by smtp.googlemail.com with ESMTPSA id c58sm6423237edb.33.2018.03.02.02.05.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Mar 2018 02:05:54 -0800 (PST) Subject: Re: Word / PDF document snippet rendering in search To: solr-user@lucene.apache.org References: From: Charlie Hull Message-ID: Date: Fri, 2 Mar 2018 10:05:58 +0000 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit On 02/03/2018 00:15, T Wild wrote: > I'm interested in building a software system which will connect to various > document sources, extract the content from the documents contained within > each source, and make the extracted content available to a search engine > such Solr. This search engine will serve as the back-end for a web-based > search application. This is basically an 'enterprise search' system. You use 'connectors' to get text out of the source documents - in Solr applications we often use Apache Tika to extract text from common formats like Office or PDF, Apache ManifoldCF is another useful project for connecting to repositories. > > I'm interested in rendering snippets of these documents in the search > results for well-known types, such as Microsoft Word and PDF. How would one > go about implementing document snippet rendering in search? If you just want the snippets as text, you can use Solr highlighters which can provide contextual snippets (i.e chunks of text around the query matches). > > I'd be happy with serving up these snippets in any format, including as > images. I just want to be able to give my users some kind of formatted > preview of their results for well-known types. If you however want to show bits of the original documents that's more difficult. You'll need to store a reference to the original document in Solr and use an external system to display it - you'll need specific systems for different doc types: PDFs can be shown in various browser plugins for example. Another approach is illustrated in this open source code we wrote a while ago - it uses OpenOffice in 'headless' mode to provide images of the source document: https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen Hope this helps! Cheers Charlie > > Thank you! > -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk