lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZiYuan <ziyu...@gmail.com>
Subject Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context
Date Tue, 20 Jun 2017 11:29:16 GMT
Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr)
because Python is my main programming language. I have an impression that
1. they send HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to
the server via HTTP or some other more native ways? Is the main benefit of
SolrJ over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <ziyuang@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just trying to
> figure out what is going on by indexing one or two PDF files first. Thank
> you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.....". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyuang@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatcher@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >>         Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyuang@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> > :
>> >> >
>> >> > <field name="content" type="text_general" indexed="false"
>> stored="true"/>
>> >> > <field name="text" type="text_general" multiValued="true"
>> indexed="true"
>> >> > stored="false"/>
>> >> > <copyField source="content" dest="text"/>
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically and
I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerickson@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll be
a
>> >> >> copyField directive copying (perhaps) many different fields to
the
>> >> >> "text" field. That permits simple searches against a single field
>> >> >> rather than, say, using edismax to search across multiple separate
>> >> >> fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which
is much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >> https://cwiki.apache.org/confluence/display/solr/
>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> >> There are ways to map meta-data fields from the doc into specific
>> >> >> fields matching your schema. Be a little careful here. There is
no
>> >> >> standard across different types of docs as to what meta-data field
>> is
>> >> >> included. PDF might have a "last_edited" field. Word might have
a
>> >> >> "last_modified" field where the two mean the same thing. Here's
a
>> link
>> >> >> to a SolrJ program that'll dump all the fields:
>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
>> easily
>> >> >> hack out the DB bits.
>> >> >>
>> >> >> BTW, once you get more familiar with processing, I strongly
>> recommend
>> >> >> you do the document processing on the client, the reasons are
>> outlined
>> >> >> in that article.
>> >> >>
>> >> >> bq: even I define the fields as he said I cannot see them in the
>> >> >> search results as keys in JSON
>> >> >> are the fields set as stored="true"? They must be to be returned
in
>> >> >> requests (skipping the docValues discussion here).
>> >> >>
>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> >> Because it has stored=false, you can only search it, you cannot
>> >> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >> >>
>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot
of
>> >> >> things, most particularly whether that text is ever actually in
a
>> >> >> field in your index. Just because there's no guarantee that the
name
>> >> >> of the file is indexed in a searchable/highlightable way.
>> >> >>
>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll
be
>> >> parsed
>> >> >> as
>> >> >> id:Trevor _text_:Hastie
>> >> >> _text_ is the default field, look for a "df" parameter in your
>> request
>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>> >> >>
>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyuang@gmail.com>
wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> I am new to Solr and I need to implement a full-text search
of
>> some PDF
>> >> >>> files. The indexing part works out of the box by using bin/post.
I
>> can
>> >> >> see
>> >> >>> search results in the admin UI given some queries, though without
>> the
>> >> >>> matched texts and the context.
>> >> >>>
>> >> >>> Now I am reading this post
>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> >>> for the highlighting part. It is for an older version of Solr
when
>> >> >> managed
>> >> >>> schema was not available. Before fully understand what it is
doing
>> I
>> >> have
>> >> >>> some questions:
>> >> >>>
>> >> >>> 1. He defined two fields:
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="false"
>> stored="true"
>> >> >>> multiValued="false"/>
>> >> >>> <field name="text" type="text_general" indexed="true"
>> stored="false"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> But why are there two fields needed? Can I define a field
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="true"
>> stored="true"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> to capture the full text?
>> >> >>>
>> >> >>> 2. How are the fields filled? I don't see relevant information
in
>> >> >>> TikaEntityProcessor's documentation
>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>> >> dataimporthandler-extras/org/
>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>> >> >> dataimport.EntityProcessorBase>.
>> >> >>> The current text extractor should already be Tika (I can see
>> >> >>>
>> >> >>> "x_parsed_by":
>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> >> >> tika.parser.pdf.PDFParser"]
>> >> >>>
>> >> >>> in the returned JSON of some query). But even I define the
fields
>> as he
>> >> >>> said I cannot see them in the search results as keys in JSON.
>> >> >>>
>> >> >>> 3. The _text_ field seems a concatenation of other fields,
does it
>> >> >> contain
>> >> >>> the full text? Though it does not seem to be accessible by
default.
>> >> >>>
>> >> >>> To be brief, using The Elements of Statistical Learning
>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> >> >> ESLII_print10.pdf>
>> >> >>> as an example, how to highlight the relevant texts for the
query
>> "SVM"?
>> >> >> And
>> >> >>> if changing the file name into "The Elements of Statistical
>> Learning -
>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>> for
>> >> the
>> >> >>> query "id:Trevor Hastie"?
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Best regards,
>> >> >>> Ziyuan
>> >> >>
>> >>
>> >>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message