lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr Cell Question
Date Fri, 06 Sep 2013 13:54:13 GMT
It's always frustrating when someone replies with "Why not do it
a completely different way?".  But I will anyway :).

There's no requirement at all that you send things to Solr to make
Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
anyway, why not just parse on the client? This has the advantage
of allowing you to offload the Tika processing from Solr which can
be quite expensive. You can use the same Tika jars that come
with Solr or download whatever version from the Tika project
you want. That way, you can exercise much better control over
what's done.

Here's a skeletal program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

FWIW,
Erick


On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson <jej2003@gmail.com> wrote:

> Is it possible to configure solr cell to only extract and store the body of
> a document when indexing?  I'm currently doing the following which I
> thought would work
>
> ModifiableSolrParams params = new ModifiableSolrParams();
>
>  params.set("defaultField", "content");
>
>  params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()");
>
>  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
> "/update/extract");
>
>  up.setParams(params);
>
>  FileStream f = new FileStream(new File(".."));
>
>  up.addContentStream(f);
>
> up.setAction(ACTION.COMMIT, true, true);
>
> solrServer.request(up);
>
>
> But the result of content is as follows
>
> <arr name="content_mvtxt">
> <str/>
> <str>null</str>
> <str>ISO-8859-1</str>
> <str>text/plain; charset=ISO-8859-1</str>
> <str>Just a little test</str>
> </arr>
>
>
> What I had hoped for was just
>
> <arr name="content_mvtxt">
> <str>Just a little test</str>
> </arr>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message