lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <acoli...@apache.org>
Subject Re: Proposal for Lucene
Date Sun, 24 Feb 2002 16:30:44 GMT
On Thu, 2002-02-07 at 16:39, Dmitry Serebrennikov wrote:
> I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as 
> a library that can exist separately from the applications. Perhaps the 
> applications should be separate targets in the Lucene project (and build 
> process) or perhaps they can be separate projects. I think keeping them 
> together would be good because Lucene's APIs may need to evolve to 
> support these applications better and because this will help ensure that 
> changes to Lucene API are reflected in the applications as soon as they 
> are made and not with a lag that can come about if the applications are 
> treated as separate, dependent projects.
> 
> See below for some additional ideas for the crawler.
> 
> Mark Tucker wrote:
> 
> >I like what you included in your proposal and suggest doing all that (over time)
and taking the following into consideration:
> >
> >Indexers/Crawlers
> >
> >	General Settings
> >		SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> >		IndexerTimeout - kill this crawler thread after long period of inactivity
> >		IncludeFilter - include only items matching filter
> >		ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
> >
> I'm working on a crawler right now actually, but it is a derivative of 
> WebSPHINX. The original WebSPHINX has not changed since a very long time 
> ago, but it is licensed under LGPL at the moment. Perhaps we can get 
> permission from the copyright holders to transfer it to APL (or do we 
> even need to?). I made a number of bug fixes to it, added support for 
> cookies (rudimentary) and support for HTTP redirects. One thing that I 
> like in WebSPHINX is that it has a forgiving HTML parser that can deal 
> with many kinds of broken HTML. Also, it has a very interesting 
> framework for analyzing parsed content, but this goes beyound the 
> requirements for use with Lucene.
> 

I'm pretty sure they'd have to make it APL for us to collaborate
significantly.

> I use the crawler with Lucene, but there is a layer of application 
> classes between the two, so the kind of integration that has been 
> proposed here has not yet been done. Anyway, I found that in addition to 
> the Include and Exclude filters, it is helpful to be able to say that 
> you want some page "expanded" (i.e. parsed and links followed), but not 
> "indexed" (i.e. added to Lucene's index). And vice versa, it seems 
> useful to index a page but not expand it, somethimes. Also, filters can 
> be evaluated on links before they are followed, and then the second time 
> on final URLs of pages retrieved. Normally the two are the same, but 
> HTTP redirects can force the final URL to be something very different 
> from the original link.
> 

Ahh... that does make sense to me...  I've added this.  I had to read
this like 3 or 4 times..  Please look over the changes I made and make
sure I explained it properly..  (It could be my little brain just took a
few times to grasp it ;-) ).

> Perhaps one way to represent these conditions is to have the following 
> "language" instead of include and exclude filters:
> 
> "include:" regex
> "exclude:" regex
> "noindex": regex
> "noexpand": regex
> 
> The first two work as the include/exclude, but for things that pass 
> these two, the others add handling properties that are used in 
> processing the link and the page. Disclaimer: I'm experimenting with 
> this now and these ideas are only about two days old, so please take 
> them as such. Since we got into the discussion, I figured I'd put them 
> on the table.
> 
> >
> >		MaxItems - stops indexing after x items
> >		MaxMegs - stops indexing after x MB of data
> >
> >	File System Indexer
> >		URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
> >
> Question: does this information really belong in the index? Perhaps the 
> root path should be specified, and the documents tagged with a relative 
> path to that path, but I think that, maybe, the URL to prefix the 
> document paths with should be given once per entire index and be easy to 
> change.
> 

Yes it must be in the index.  This replace context is already in the
abstractcrawler.

> >
> >		
> >	Web Indexer
> >		HTTPUser
> >		HTTPPassword
> >		HTTPUserAgent
> >		ProxyServer
> >		ProxyUser
> >		ProxyPassword
> >		HTTPSCertificate
> >		HTTPSPrivateKey
> >
> Apache Commons has HTTPClient package that has some similar concepts and 
> even implements them to some degree. I found it a bit rough still and 
> dependent on JDK 1.3, but it can be fixed easier than a new one written 
> I believe. It uses a notion of an HttpState, which is a state container 
> for an HTTP user agent, containing things like authentication 
> credentials and cookies. HTTPS support is easy to add with JSSE (which 
> is the approach taken by the HttpClient from the Commons).
> 

I actually had HttpClient in mind (have only looked at the description)
the whole time I typed this..  We can use whatever, but it makes sense
to use this if its available.  Such specific details don't belong in
this particular proposal (we're answering "What" not "How") but once we
get a proposal we like we can look at that in the implementation plan.  

> >
> >
> >	Other Possible Indexers
> >		Microsoft Exchange 5.5/2000
> >		Lotus Notes
> >		Newsgroup (NNTP)
> >		Documentum
> >		ODBC/OLEDB
> >		XML - index single XML that represents multiple documents
> >
> One idea that might prove useful is to add a "DocumentFetcher" in 
> addition to the DocumentIndexer. The two would go hand in hand, and 
> document entries created in Lucene by a particular Indexer can be 
> understood by a corresponding Fetcher. The Fetcher would then 
> encapsulate retrieval of source documents or creating useful pointers to 
> them (like URLs).
> 

I like that...  I'm just trying to figure out "How" to do that
(design-wise)..  How do we seperate the concerns of the retrieval from
the link crawling etc?  Could you perhaps patch the proposal with a
design.? 

> Another idea is to split the document storage and "envelope" from its 
> content. The content is subject to a MIME type and can be handed to a 
> parser, passed to a document factory, mapped to fields, etc. However, 
> the logic of retrieving a PDF file from a Lotus Notes database (and 
> creating a URL to point back to it), is different than getting the same 
> PDF file from the file system. The same parser and a document factory 
> can still be used though.
> 

Right..  I'm not sure we should do this at first...maybe for a later
iteration.  Thats a lot to bite off in one chew.  I want to match and
slightly exceed htDig at first (not a competitive thing, its just what I
use currently)..  Nail the 80% first and worry about the 20% later so
that we minimize up front complexity (iterative programming, etc etc) 

-Andy

> >
> >
> >Document Factory		
> >	General
> >		The minimum properties for each document should be:
> >			URL
> >			Title
> >			Abstract
> >			Full Text
> >			Score
> >
> >	HTML
> >		Support for META tags including Dublic Core syntax
> >
> >	Other Possible Document Factories
> >		Office Docs - DOC, XLS, PPT
> >		PDF
> >		
> >
> >Thanks for the great proposal.
> >
> Yes! Absolutely! Great proposal!
> 
> --Dmitry
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 
-- 
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document 
                            format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html 
			- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message