lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 12:29:00 GMT
André Warnier wrote:
> Andrzej Bialecki wrote:
>> André Warnier wrote:
>>> Same thing.  In a collection that size, to allow meaningful searches 
>>> one would need to store positional information to allow proximity 
>>> searches.   I thus question the 35% figure.
>> However, Lucene is very good at compressing this data, especially if 
>> we consider un-stored fields, i.e. fields that are only indexed. Try 
>> it and measure it, and then question ;)
> Allright, I will retry it.  I guess one should not keep relying on older 
> measurements.
> My original figure of 300% was based, I admit, on a relatively 
> comprehensive indexing of documents, with distinct search indexes per 
> "field", proximity, no stemming, no exclusion of stopwords etc..
> Basically what one needs to search professionally in published technical 
> literature e.g.

Well, but now we are talking about much more than a single field with 
posting information. It's always possible to add a lot of other metadata 
per each document, but this doesn't tell anything about the raw vs. 
indexed size, it just says you put a lot of additional data in there ;)

In order to compare the raw text vs. index size you need to create an 
index with a single field that indexes this plain text (with full 
position info, which is the default in Lucene). If you do this, you will 
find that the 35% figure is more or less accurate.

>> This is not a good advice. You need to convert the PDFs to plain text 
>> first - Lucene as such indexes only plain text, and its demo 
>> application can parse at most plain text and simple HTML. It won't be 
>> able to parse PDFs, instead it will create a lot of garbage terms.
> I did not remember that, sorry.
> But I suppose there must exist some kind of "plugin" to process PDFs, no ?
> Basically I'm trying to help the OP into running a test on a limited set 
> of his documents (which are PDFs of so far unknown content, but 
> supposedly text), with a minimum of installation complications, and 
> allowing him to visualise the results.
> This in order for him to get a feel for what indexing his whole 
> collection would entail.
> So, what would be the correct recommendation in this case ?

Get a copy of Nutch, which includes a PDF parser (PDFBox, accurate but 
slow), and run it on this collection using the file:/// urls and 

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message