oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Catalog queries
Date Fri, 23 Sep 2011 18:17:53 GMT
Hi Tom,

Thanks. Comments below:

On Sep 22, 2011, at 2:30 PM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about building queries for filemgr lucene catalogs and I was thinking
someone may be able to help me.
> 
> I've ingested some files into catalog and then using the command line tools (and aliases
- thanks Cameron!) to query the catalog.
> 
> I'm not too familiar with writing SQL queries, but I've been able to achieve the the
following types of queries:
> 
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT Observer,Description,Duration,ExperimentID
FROM KatFile WHERE Observer='jasper'" --sortBy Duration
> 
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
> 
> 
> bin$ ./query_tool --url http://localhost:9000 --lucene -query 'Observer:sharmila'
> 
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> 
> 
> Questions:
> 	• The SQL query does what I expect ;-) but with one problem - in what order will I
receive the data? I can't figure out an automatic way to find out which column is which data.

Good question! It looks like it just prints the metadata in any order, as opposed to the order
that you received it. This is probably not a great thing to do, so 
can you file an issue and we can take a look at it?

> 	• Is full SQL query syntax supported?

Nope, it's just a small subset. You can see what's supported here:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Improvements welcome! :)

> 	• The Lucene query returns the productID. Is there a class I can use that will return
something similar to the sql query? (Although I should look at the code and find this out
for myself - asking is free :-)

Heh, great question, but the answer is no. We didn't really standardize on the output from
these tools. I originally developed the QueryTool (which understood Lucene to begin with,
and later Brian Foster added his SQL syntax to it, and the associated response format). 

Maybe we should open up an issue (and associated wiki page) on standardizing on the output.
Feel free to propose something and I'll be happy to join in (hopefully others will too).

> 	• I've not yet tested any more complex SQL and Lucene queries - I was just wondering
if there where any useful info out there that would show me some more funky example queries.
So far I've found lucene tutorial and sql quick ref. I'll tie this into OODT Filemgr User
Guide once I've figured these things out.

+1, that's the best place to start. We also only support a limited set of the Lucene syntax
as well, see the following class:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

> 	• I see the version of lucene being used it quiet old (2.0.0 and the latest ver is
2.9.1). Is there any reason why OODT is using this old version?

I would *love* to upgrade to 2.9.1 or 2.9.4.

Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for
getting hits back I believe in the 3.x 
series, however we should be forwards compat to e.g., 2.9.4.

http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

> 	• Should I be spending the effort to use a different (i.e. sql database) or are other
OODT implementations using lucene?
> Thanks in advance for any help.

Great question.

Most of the folks use Lucene to begin with, because it requires no external database or service,
it just works out of the box. It 
also has a number of other advantages:

* Easy unit testing against your index
* You can copy around FM index directories and share them between machines
* You can test locally on your laptop by copying the FM index off of a server onto your laptop,
and then spinning up a local FM from there. The file refs won't exist, but you can play around
with the catalog and most other things work.
* You can open up the FM index in Luke http://getopt.org/luke/ and then browse and query the
Index using the Full Lucene Syntax
* It's fairly scalable (up to 10s of M of products). You can scale beyond, but you have to
get into index partitioning, backups, etc., Also time queries at that stage token explosion
(e.g., doing a range query for 2001-01-01T00:00:00.000Z to 2003-01-01T00:00:00.000Z will explode),
mainly to do with the SerDe format for storing CAS metadata and product information that we
used in the LuceneCatalog. This can be improved to scale beyond a few million products, but
no one has invested the effort into that yet, they typically just use a SQL RDBMS, and the
DataSourceCatalog at that point 

To move your existing index to the DataSourceCatalog, there's a tool in FM that I wrote called
ExpImpCatalog. You can find it here: http://s.apache.org/Xuq

To use the tool in an existing FM deployment, do the following:

1. Stand up a new FM that you are going to configure with your DataSourceCatalog. 
  - change the port to 9010
  - if your existing FM is in e.g., /usr/local/filemgr, put this new one in /usr/local/filemgr2
  - configure it with the DataSourceCatalog
  - set up your DB and bake in the parameters to the FM config

2. Go into /usr/local/filemgr/bin (your existing, Lucene-based FM)
    - run java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog you
should see:

]$ java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog
ExpImpCatalog [options] 
--source <url>
--dest <url>
 --unique
[--types <comma separate list of product type names>]
[--sourceCatProps <file> --destCatProps <file>]

This tool works like the following:
   You give it either a combination of: --source and --dest OR
                                  a combination of: --sourceCataProps and --destCatProps

In the case of simply --source and --dest, it will import all of the source catalog into the
dest catalog via XML-RPC, talking to 
your source FM URL, and your dest FM URL. In the case of the--sourceCatProps and --destCatProps,
it will do the same 
thing, except it won't use XML-RPC as the transport layer, it will simply instantiate a copy
of the source Catalog interface object, 
and the dest Catalog interface object (in a single JVM), and import product and met at a time
from source to dest. I made the 
props based portion of the tool to avoid transferring large met and product objects over XML-RPC,
and to keep them 
within a JVM. 

The --unique parameter will not import a source product ID into a dest catalog if that product
ID exists in the dest catalog. The 
--types parameter specifies a comma separated list of Product Types to export from the source
catalog into the dest catalog. 
If --types is omitted all product types are assumed.

So, there is an easy way to migrate from an existing Lucene index FM catalog into any other
Catalog fronted by the FM. 
Another thing people do sometimes is that if you have the source data and the ingestion pipeline,
they will just blow away 
the Lucene (or whatever) Catalog, and then re-ingest using the Crawler/FM/Curation pipeline
into e.g., a new DataSourceCat, 
that they configure their existing FM to now use.

Hope that helps explain things. These would probably be good javadocs, plus Wiki pages for
these tools and migration :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message