lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Trey Grainger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)
Date Sun, 21 Mar 2010 15:32:27 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847936#action_12847936
] 

Trey Grainger commented on SOLR-1837:
-------------------------------------

Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed
it was located in the DocReconstructor - if you are aware of others then please report them
using the Luke issue tracker.

I just pulled down the most recent Luke code, and it does looks like that recent fix was made
to cover the bug I saw.  Unfortunately, the fix results in a null ref for me on my index.
 I'll open an issue, as it looks like all that's needed is an extra null check.

Re: Document reconstruction is a very IO-intensive operation, so I would advise against using
it on a production system, and also it produces inexact results (because analysis is usually
a lossy operation).

I hear you about it being IO-intensive.  There's also other admin tools in Solr which do similarly
intensive operations (the schema browser, for example, which generates a list of all fields
and a distribution of terms within those fields).  The intent of the tool is for one-off debugging,
not for any kind of automated querying, but I'll try do some tests to see to what degree this
tool is affecting our current production systems (I have not see any noticeable effect thus
far).

Also, regarding the process being lossy.  In this case, that is kind of the point of the tool
(in my use) - to see what has actually been put into the index vs what was in the document
sent to the engine.  For example, if I index a field with the text "Wi-fi hotspots are a life-saver"
with payloads on parts of speech, as well as stemming I want to be able to see something like:
"wi [1] / fi [1] | wifi [1] / hotspot [1] / are [2] / a [3] / life [1] / saver [1] | lifesaver
[1]"

With no payloads, this would simply be
"wi / fi | wifi / hotspots | hotspot / are / a / life / saver | lifesaver"

So I had initially named to tool the Solr Document Reconstructor, after the name you gave
to the tool in Luke.  Based on your comments, I think it might be less confusing for me to
call it something like "Document Inspector", since it is not truly reconstructing the original
document.

I'll try to get what I have pushed up today so you can check it out if you want.  Thanks for
your great work on that tool!

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any
particular document.  While the analysis page is good when you have specific content and a
specific field/type your want to test the analysis process for, once a document is indexed
it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only,
doesn't understand solr schema, doesn't display many non-text fields in human readable format,
doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use
in a production Solr environment, slow or difficult to check from a remote location, etc.).
 The document reconstruction feature of Luke provides the base for what can become a much
more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message