Hello All,
I have been given the envious job of upgrading existing faceted taxonomy indexes from 3.6
to 5.3.
To make sure that I have everything in working order, I have written a little program to “smoke
test” . Facets retrieved in version 3 should be retrievable in version 5, or our upgrade
has failed.
Unfortunately, I can’t seem to put together a quick program to validate my date once it
is upgraded to version 5. Can someone tell me where I have gone off the rails?
In this email, I include:
1. The 3.6.2 validation code … (establishes what should be seen after the upgrade runs)
1.1. mvn dependencies
1.2. source code
1.3. output
2. The lucene upgrade shell script
3. The 5.3.1 validation code (that doesn’t generates nulls and isn’t quiet right)
3.1. mvn dependencies
3.2. source code
4. The url for the compressed tar file of the index data stored in drop box.
Here are the key maven dependencies that I used for the 3.6 source:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-facet</artifactId>
<version>3.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>3.6.0</version>
</dependency>
Here is the code to retrieve facet data from the version 3.6 index (which does work against
version 3.6 lucene):
public class FacetRunner {
public static void main(final String[] args) throws Exception {
File indexDirFile = new File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene3/data/doc-index/lucene");
Directory indexDir = new SimpleFSDirectory(indexDirFile);
IndexReader indexReader = IndexReader.open(indexDir);
Searcher searcher = new IndexSearcher(indexReader);
File taxonomyIndexDirFile = new File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene3/data/facets");
Directory taxonomyIndexDir = new SimpleFSDirectory(taxonomyIndexDirFile);
TaxonomyReader taxo = new DirectoryTaxonomyReader(taxonomyIndexDir);
Term aTerm = new Term("$facets", "$fulltree$");// new Term("text", "clarissa");
Query q = new TermQuery(aTerm);
TopScoreDocCollector tdc = TopScoreDocCollector.create(10,true);
FacetSearchParams facetSearchParams = new FacetSearchParams();
facetSearchParams.addFacetRequest(new CountFacetRequest(
new CategoryPath("brs_recipient_domain"), 10));
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader,
taxo);
searcher.search(q, MultiCollector.wrap(tdc, facetsCollector));
List<FacetResult> res = facetsCollector.getFacetResults();
for (FacetResult facetResult:res) {
System.out.println(facetResult.toString());
}
}
Output looks like:
Request: brs_recipient_domain nRes=10 nLbl=10
Num valid Descendants (up to specified depth): 486
Facet Result Node with 10 sub result nodes.
Name: brs_recipient_domain
Value: 2896.0
Residue: 1497.0
Subresult #0
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/enron.com
Value: 1979.0
Residue: 0.0
Subresult #1
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/aol.com
Value: 124.0
Residue: 0.0
Subresult #2
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/bracepatt.com
Value: 84.0
Residue: 0.0
Subresult #3
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/txu.com
Value: 63.0
Residue: 0.0
Subresult #4
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/hotmail.com
Value: 46.0
Residue: 0.0
Subresult #5
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/teneo-test.com
Value: 42.0
Residue: 0.0
Subresult #6
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/yahoo.com
Value: 41.0
Residue: 0.0
Subresult #7
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/dttus.com
Value: 34.0
Residue: 0.0
Subresult #8
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/velaw.com
Value: 30.0
Residue: 0.0
Subresult #9
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/netzero.net
Value: 28.0
Residue: 0.0
Process finished with exit code 0
To upgrade the indexes, I have written a shell script that runs the IndexUpgrader using the
4.10.4 core jar to bring the facet index to 4 and the document index to 4.
#!/bin/sh
export JARS_HOME=/users/scott/projects/prototypes/lucene-3-and-5/jars
echo "===>>>>>migrating lucene data from 3 to 4<<<<<========="
echo
export LUCENE_4_PATH=$JARS_HOME/lucene-core-4.10.4.jar
date "+DATE: %Y-%m-%d%nTIME: %H:%M:%S"
echo "upgrading facets taxonomy indices from 3 to 4 with command time java -cp $LUCENE_4_PATH
org.apache.lucene.index.IndexUpgrader facets"
time java -cp $LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader facets
echo
echo "upgrading document indices from 3 to 4 with command time java -cp $LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader
doc-index/lucene"
time java -cp $LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader doc-index/lucene
echo
echo "===>>>>>migrating lucene data from 4 to 5<<<<<========="
echo
export LUCENE_5_PATH=$JARS_HOME/lucene-backward-codecs-5.3.1.jar:$JARS_HOME/lucene-core-5.3.1.jar
echo "upgrading facets taxonomy indices from 4 to 5 with command time java -cp $LUCENE_5_PATH
org.apache.lucene.index.IndexUpgrader facets"
time java -cp $LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader facets
echo
echo "upgrading document indices from 4 to 5 with command time java -cp $LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader
doc-index/lucene"
time java -cp $LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader doc-index/lucene
echo
echo "done upgrading from lucene 3 to lucene 5"
date "+DATE: %Y-%m-%d%nTIME: %H:%M:%S"
no errors occur.
At this point, my index documents look like version 5 lucene.
Now I want to validate my indexes and pull similar (if not the same data) from the upgraded
indexes.
Here are the maven dependencies for the 5.3.1. source
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-facet</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>5.3.1</version>
</dependency>
Here is my 5.3.1 program - it return’s nulls - what am I doing wrong?.
public static void main(final String[] args) throws Exception {
File indexDirFile = new File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene5/data/doc-index/lucene");
Path indexDirFilePath = indexDirFile.toPath();
Directory indexDir = new SimpleFSDirectory(indexDirFilePath);
IndexReader indexReader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(indexReader);
File taxonomyIndexDirFile = new File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene5/data/facets");
Path taxonomyIndexDirFilePath = taxonomyIndexDirFile.toPath();
Directory taxonomyIndexDir = new SimpleFSDirectory(taxonomyIndexDirFilePath);
TaxonomyReader taxo = new DirectoryTaxonomyReader(taxonomyIndexDir);
Term aTerm = new Term("$facets", "$fulltree$");
Query q = new TermQuery(aTerm);
FacetsCollector facetsCollector = new FacetsCollector();
//searcher.search(q, MultiCollector.wrap(tdc, facetsCollector));
//FacetsCollector.search(searcher, new MatchAllDocsQuery(),10,facetsCollector);
FacetsCollector.search(searcher, q, 10, facetsCollector);
FacetsConfig config = new FacetsConfig();
//config.set
Facets facets = new FastTaxonomyFacetCounts(taxo, config, facetsCollector);
FacetResult result = facets.getTopChildren(10, "brs_recipient_domain");
for (LabelAndValue labelValue : result.labelValues) {
System.out.println(String.format("%s (%s)", labelValue.label, labelValue.value));
}
}
Here is the url to a gzipped tar that contains the index (not yet upgraded): https://www.dropbox.com/s/qbr7ogwgekatrdf/faceted_lucene_data.tar.gz?dl=0
<https://www.dropbox.com/s/qbr7ogwgekatrdf/faceted_lucene_data.tar.gz?dl=0>
Thanks for your help.
SCott
|