lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 10:58:10 GMT
André Warnier wrote:

> Same thing.  In a collection that size, to allow meaningful searches one 
> would need to store positional information to allow proximity searches. 
>   I thus question the 35% figure.

However, Lucene is very good at compressing this data, especially if we 
consider un-stored fields, i.e. fields that are only indexed. Try it and 
measure it, and then question ;)

> A practical tip now :
> I have not done this in a while, but as I recall installing Lucene on a 
> PC is a really easy thing to do. It comes with a trial indexing "robot", 
> which you can just point to a directory containing documents, and it 
> will go off and index them for you.
> So create such a directory, put a sample of your documents in it, let it 
> run and look at the results.  The whole thing will take you one hour at 
> the most, and it will provide you with a very first estimate of what 
> your issues really are, and give you some real bases to start thinking.

This is not a good advice. You need to convert the PDFs to plain text 
first - Lucene as such indexes only plain text, and its demo application 
can parse at most plain text and simple HTML. It won't be able to parse 
PDFs, instead it will create a lot of garbage terms.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message