lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Specialized Solr Application
Date Fri, 20 Apr 2018 17:28:02 GMT
>    1) the toughest pdfs to identify are those that are partly
    searchable (text) and partly not (image-based text).  However, I've
    found that such documents tend to exist in clusters.
Agreed.  We should do something better in Tika to identify image-only pages on a page-by-page
basis, and then ship those with very little text to tesseract.  We don't currently do this.

>    3) I have indexed other repositories and noticed some silent
    failures (mostly for large .doc documents).  Wish there was some way
    to log these errors so it would be obvious what documents have been
    excluded.
Agreed on the Solr side.  You can run `java -jar tika-app.jar -J -t -i <input_dir> -o
<output_dir>` and then tika-eval on the <output_dir> to count exceptions, even
exceptions in embedded documents, which are now silently ignored. ☹

>   4) I still don't understand the use of tika.eval - is that an
    application that you run against a collection or what?
Currently, it is set up to run against a directory of extracts (text+metadata extracted from
pdfs/word/etc).  It will give you info about # of exceptions, lang id, and some other statistics
that can help you get a sense of how well content extraction worked.  It wouldn't take much
to add an adapter that would have it run against Solr to run the same content statistics.

>    5) I've seen reference to tika-server - but I have no idea on how
    that tool might be usefully applied.
 We have to harden it, but the benefit is that you isolate the tika process in its own jvm
so that it can't harm Solr.  By harden, I mean we need to spawn a child process and set a
parent process that will kill and restart on oom or permanent hang.  We don't have that yet.
 Tika very rarely runs into serious, show stopping problems (kill -9 just might solve your
problem).  If you only have a few 10s of thousands of docs, you aren't likely to run into
these problems.  If you're processing a few million, esp noisy things that come of the internet,
you're more likely to run into these kinds of problems.

>    6) Adobe Acrobat Pro apparently has a batch mode suitable for
    flagging unsearchable (that is, image-based) pdf files and fixing them.
 Great.  If you have commercial tools available, use them.  IMHO, we have a ways to go on
our OCR integration with PDFs.

>    7) Another problem I've encountered is documents that are themselves
    a composite of other documents (like an email thread).  The problem
    is that a hit on such a document doesn't tell you much about the
    true relevance of each contained document.  You have to do a
    laborious manual search to figure it out.


Agreed.  Concordance search can be useful for making sense of large documents <self_promotion>
https://github.com/mitre/rhapsode </self_promotion> The other thing that can be useful
for handling genuine attachments (pdfs inside of email) is to treat the embedded docs as their
own standalone/child doc (see github link and SOLR-7229.


>    8) Is there a way to return the size of a matching document (which,
    I think, would help identify non-searchable/image documents)?
Not that I'm aware of, but that's one of the stats calculated by tika-eval.  Length of extracted
string, number of tokens, number of alphabetic tokens, number of "common words" (I took top
20k most common words from Wikipedia dumps per lang)...and others.

Cheers,

            Tim
Mime
View raw message