lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: How does lucene handle content-type
Date Thu, 05 Apr 2007 17:31:31 GMT
Lucene has no built-in recognition of anything. You have to parse
the header and index the relevant bits as you need to.

There are projects *based* upon lucene that do web crawls that
you might want to look into, Nutch comes to mind.


On 4/5/07, Developer Developer <> wrote:
> I am using WGET to download content from the www with ---save-header
> option.
> The save-header option saves the hppt header to the downloaded files.
> Does Lucene make use of content type  while indexing  or  I have to parse
> the header , determine the content-type and determine the right set of
> actions to do ?
> Thanks !

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message