lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Populating a custom Solr field with text extracted from document
Date Thu, 24 Nov 2011 01:19:31 GMT

: I am a new Solr user, and would like to create a new custom field that is
: then populated with text extracted from each document when I crawl my file
: system.

what are you using to do the crawling?

Typically people feed solr structured data -- there are some things in 
Solr (like the ExtractingRequestHandler) that help you pull structure out 
of unstructured or semi-structured files, and there are things like DIH 
that can help you pull data from structure (or semi-structured) sources, 
but those aren't end-all-be-all solutions to all problems -- they aim to 
meet the 80/20 rule of simple common cases.

If you have special requirements about parsing special files...

: text text text... Received : 04 Jan 2002 17:31:40 ...text text text

...you'll need to write your own special code for parsing those files to 
extract the structure you want.

where/how you use your custom code depends on your use cases -- maybe you 
write a custom extractor for Tika nad then use ExtractingRequestHandler, 
maybe you write a custom EntityProcessor and then use DataImportHandler, 
or maybe you just parse the code in the client langauge of your choice and 
POST it to Solr over HTTP ... it all depends on your use case and what you 
are comfortable with.

BTW: Since you definitely seem to interested in using Solr, you should 
consider sending subsequent questions to the solr-user@lucene mailing list 
(general@lucene is generally for discussions about hte overall Lucene 
project, and/or questions when people really have no idea what they want 
to use)

-Hoss

Mime
View raw message