From general-return-3434-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Thu Jul 21 02:59:51 2011 Return-Path: X-Original-To: apmail-lucene-general-archive@www.apache.org Delivered-To: apmail-lucene-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF5B8653E for ; Thu, 21 Jul 2011 02:59:50 +0000 (UTC) Received: (qmail 73310 invoked by uid 500); 21 Jul 2011 02:59:48 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 72802 invoked by uid 500); 21 Jul 2011 02:59:46 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 72783 invoked by uid 99); 21 Jul 2011 02:59:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jul 2011 02:59:45 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [138.37.95.139] (HELO flan-planb.eecs.qmul.ac.uk) (138.37.95.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jul 2011 02:59:38 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=eecs.qmul.ac.uk; s=beta; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:References:Subject:To:MIME-Version:From:Date:Message-ID; bh=wKbkDFQW3fi7e3pr/xVigUYUuWHckVOEq7eS/Z+Mhq8=; b=N03oM/cIcZ9N24X+9yeWpOhHzH8NoskcLWtt/Kp1wkpIy5CtjnoHU6Tkv281NvMkJlVanMo2ROB1sukBLGnRx49DwFBEtd/QeFjwy2qndciJ+S3hC2DosPwQbw9wpGpno1zxXRWzrbmdoSeIortdZVXG++SnMxLNCaYKegC2Fwg=; Received: from client-86-17-88-3.cht-bng-13.adsl.virginmedia.com ([86.17.88.3] helo=[192.168.1.2]) by smtp.eecs.qmul.ac.uk:587 with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) authenticated as oro30 (person) (Exim 4.76_RC2) (envelope-from ) id 1QjjTp-0003Dj-VK for general@lucene.apache.org; Thu, 21 Jul 2011 03:59:18 +0100 Message-ID: <4E279605.3090604@eecs.qmul.ac.uk> Date: Thu, 21 Jul 2011 03:59:17 +0100 From: Raphael Osamede Omoregbee User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10 MIME-Version: 1.0 To: general@lucene.apache.org Subject: Re: Indexing with Lucene References: <1311167841625-3185409.post@n3.nabble.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 20/07/11 22:32, Simon Willnauer wrote: > On Wed, Jul 20, 2011 at 3:17 PM, raphael812 wrote: >> Hello everyone, >> >> I am quite new to lucene and i am using the book lucene in action to learn. >> I need help in extracting the body content of a html page using tika. The >> implementation from the book only extracts the html's metadata not the main >> body content which i need. Is it possible to extract body content from htmls >> and pdfs and how. >> Thanks for you help. > hey, > this seems to be a tika / extraction specific question. you should > try to ask this question on the tika list, I bet you get a quick > response there! > > simon >> Raphael >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> Hello all, i tried searching through an index i created but it gives me the following error in Netbeans 6.9 Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:249) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:73) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:202) at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:63) at Searcher.search(Searcher.java:66) at Searcher.main(Searcher.java:59) The trouble is i am able to search that same index using the command line. does anyone have an idea why this is so. it was working some weeks ago on netbeans and now it throws this error. thanks for the help.