lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawn Zoƫ Raison <d...@digitorial.co.uk>
Subject Analysers for newspaper pages...
Date Mon, 28 Nov 2011 19:09:52 GMT
Hi folks,

I'm researching the best options to use for analysing/storing newspaper 
pages in out online archive, and wondered if anyone has any good hints 
or tips on good practice for this type of media?

I'm currently thinking alone the lines of using a customised 
StandardAnalyser (no stop words + extra date token detection) wrapped 
with a Shingle filter and finally a Stopword filter - the thinking being 
this should reduce the impact of stop words but still allow "to be or 
not to be" searches...

A future aim is to add a synonym filter at search time.

We currently have ~2.5million pages - some of the older broadsheet pages 
can have a serious number of tokens.
We currently index using the SimpleAnalyser - a hangover from the 
previous developers I hope to remedy :-).

-- 

Rgds.
*Dawn Raison*
Technical Director, Digitorial Ltd.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message