Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 39578 invoked from network); 5 Apr 2007 17:32:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Apr 2007 17:32:02 -0000 Received: (qmail 13592 invoked by uid 500); 5 Apr 2007 17:32:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13548 invoked by uid 500); 5 Apr 2007 17:32:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13537 invoked by uid 99); 5 Apr 2007 17:32:02 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 10:32:02 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 66.249.90.176 as permitted sender) Received: from [66.249.90.176] (HELO ik-out-1112.google.com) (66.249.90.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2007 10:31:54 -0700 Received: by ik-out-1112.google.com with SMTP id b35so460809ika for ; Thu, 05 Apr 2007 10:31:33 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=qFOI0rdEfBY+OMCuwC/I5dTw/u9QBAETNgxiwrlKFrHSTxlLHcuj9WCj9G/n02PNmR3B8J+V9+0wJ07fnl1jlZVetA2jyTa++ETCSY5qLbQy+u/JTljaP+4ADMIgymLccyK2Gzl1QeRUGEoHovX9Bn/eqcoB4gD5x5LQ1U6m5VI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=dk/DMsZHE6ZEjH4NF+0B07x4Mr7I2il4mWPdg+p+MqEVteKNeLnrIYElsbr9GCxa/BlPZD966KZURNuqa61ruBcbnDKPVLkfOkCjs5WnhALRZbM/yp6+6ZZTL5sb0Sxl5d8DqEB8CrvMtMv3ytqyfQYO3Wdonh0krOw3zTY+Q8w= Received: by 10.114.155.1 with SMTP id c1mr814188wae.1175794291936; Thu, 05 Apr 2007 10:31:31 -0700 (PDT) Received: by 10.114.58.3 with HTTP; Thu, 5 Apr 2007 10:31:31 -0700 (PDT) Message-ID: <359a92830704051031u5133ad4ds3f3cce4a4634cd6f@mail.gmail.com> Date: Thu, 5 Apr 2007 13:31:31 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: How does lucene handle content-type In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_55607_19344063.1175794291873" References: X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_55607_19344063.1175794291873 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Lucene has no built-in recognition of anything. You have to parse the header and index the relevant bits as you need to. There are projects *based* upon lucene that do web crawls that you might want to look into, Nutch comes to mind. Erick On 4/5/07, Developer Developer wrote: > > I am using WGET to download content from the www with ---save-header > option. > The save-header option saves the hppt header to the downloaded files. > Does Lucene make use of content type while indexing or I have to parse > the header , determine the content-type and determine the right set of > actions to do ? > > Thanks ! > ------=_Part_55607_19344063.1175794291873--