From java-user-return-54057-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Mon Nov 5 10:27:25 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE470D204 for ; Mon, 5 Nov 2012 10:27:25 +0000 (UTC) Received: (qmail 72607 invoked by uid 500); 5 Nov 2012 10:27:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72550 invoked by uid 500); 5 Nov 2012 10:27:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72542 invoked by uid 99); 5 Nov 2012 10:27:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Nov 2012 10:27:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of damian.birchler@bsiag.com designates 62.12.137.235 as permitted sender) Received: from [62.12.137.235] (HELO mail.bsiag.com) (62.12.137.235) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Nov 2012 10:27:16 +0000 Received: from BSIP9550.bsiag.local (unknown [10.0.12.115]) by mail.bsiag.com (Postfix) with ESMTPS for ; Mon, 5 Nov 2012 11:26:52 +0100 (CET) Received: from BSIP9550.bsiag.local ([fe80::5913:6d22:a528:5a17]) by BSIP9550.bsiag.local ([fe80::5913:6d22:a528:5a17%15]) with mapi id 14.01.0379.000; Mon, 5 Nov 2012 11:26:53 +0100 From: Damian Birchler To: "'java-user@lucene.apache.org'" Subject: Overriding DefaultSimilarity to not consider tf/idf and friends Thread-Topic: Overriding DefaultSimilarity to not consider tf/idf and friends Thread-Index: Ac27QBHqx7g12NweTsmzZiQhV+P+bg== Date: Mon, 5 Nov 2012 10:26:52 +0000 Message-ID: Accept-Language: de-CH, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.0.30.194] MIME-Version: 1.0 Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg="sha1"; boundary="----4B999752A792420B9869C0FC695DEA27" X-Virus-Checked: Checked by ClamAV on apache.org ------4B999752A792420B9869C0FC695DEA27 Content-Type: multipart/alternative; boundary="_000_DCE2441AD39CC844A71697D1893E6C352D8983D2BSIP9550bsiaglo_" MIME-version: 1.0 --_000_DCE2441AD39CC844A71697D1893E6C352D8983D2BSIP9550bsiaglo_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi everyone We are using Lucene to search for possible duplicates in an address databas= e. We create an index with a document for each person in the database. Each= document has a field with one term for the first name, a field with one te= rm for the last name and so on. I think in this setting it doesn't make sen= se to let term frequency, inverse document frequency and friends influence = the document score (or does it?). For this reason I'm thinking of overridin= g DefaultSimilarity to not take tf/idf into account when scoring. Do you think that's a reasonable thing to do? If so, how should I proceed (= I'm looking for implementation details here; should I, e.g., override the m= ethod that calculates the term frequency to just return a constant [altough= t, at the top of my head, I wouldn't know what would be a sensible constant= .]). Thanks a lot, Damian --_000_DCE2441AD39CC844A71697D1893E6C352D8983D2BSIP9550bsiaglo_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi everyone

 

We are using Lucene to search f= or possible duplicates in an address database. We create an index with a do= cument for each person in the database. Each document has a field with one = term for the first name, a field with one term for the last name and so on. I think in this setting it doesnR= 17;t make sense to let term frequency, inverse document frequency and frien= ds influence the document score (or does it?). For this reason I’m th= inking of overriding DefaultSimilarity to not take tf/idf into account when scoring.

 

Do you think that’s a rea= sonable thing to do? If so, how should I proceed (I’m looking for imp= lementation details here; should I, e.g., override the method that calculat= es the term frequency to just return a constant [altought, at the top of my head, I wouldn’t know what would be a se= nsible constant.]).

 

Thanks a lot,=

Damian

 

--_000_DCE2441AD39CC844A71697D1893E6C352D8983D2BSIP9550bsiaglo_-- ------4B999752A792420B9869C0FC695DEA27 Content-Type: application/x-pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIIIsQYJKoZIhvcNAQcCoIIIojCCCJ4CAQExCzAJBgUrDgMCGgUAMAsGCSqGSIb3 DQEHAaCCBg4wggYKMIIE8qADAgECAhAAz5HQOHbBxCjCFQZLKjVzMA0GCSqGSIb3 DQEBBQUAMFUxCzAJBgNVBAYTAkNIMRUwEwYDVQQKEwxTd2lzc1NpZ24gQUcxLzAt BgNVBAMTJlN3aXNzU2lnbiBQZXJzb25hbCBTaWx2ZXIgQ0EgMjAwOCAtIEcyMB4X DTEyMDgyNzE0MjYxNloXDTEzMDgyNzE0MjYxNlowVjEqMCgGA1UEAxMhU2VjdXJl IE1haWw6IFNFUFBtYWlsIENlcnRpZmljYXRlMSgwJgYJKoZIhvcNAQkBFhlkYW1p YW4uYmlyY2hsZXJAYnNpYWcuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB CgKCAQEAzT8iPhxcCzNwZsl2zLcoOPNu9CyiOMf4SSYaJYRajd55LhLFjD7QOdZJ fcbgccYRZLh+Rm1FXSgyhYZIf6MCWoWO9fC1Nd4S6vPfyWmvQHoSF2v7wBkQW1w6 lsbTdcT8pJUP/W9LIFpqBwOt9iHKD2JdoFzCKhorgOk1cLSFMvJY8U+vaJNt3IZQ kuSSWR+DEA9pUlCMOz9qMmSc3vdYQquc8B2p57/Guu5gWN0Rt10og5jNNBrUHGBV V5ot6RGtQj6sQCmK8aFMckqBc/maNeXuKrlIxoWmRR8iKMOI7gzPW+cnV7SzsLQy PzRWSj1DIvWrYayxzgo6akd1kGgyowIDAQABo4IC0zCCAs8wJAYDVR0RBB0wG4EZ ZGFtaWFuLmJpcmNobGVyQGJzaWFnLmNvbTAOBgNVHQ8BAf8EBAMCBLAwEwYDVR0l BAwwCgYIKwYBBQUHAwQwHQYDVR0OBBYEFAz6SJZZVtpo3P6wgK/Z2acQxNbUMB8G A1UdIwQYMBaAFOs1sVZtFWBY9OEizRxGHK7QBABlMIH/BgNVHR8EgfcwgfQwR6BF oEOGQWh0dHA6Ly9jcmwuc3dpc3NzaWduLm5ldC9FQjM1QjE1NjZEMTU2MDU4RjRF MTIyQ0QxQzQ2MUNBRUQwMDQwMDY1MIGooIGloIGihoGfbGRhcDovL2RpcmVjdG9y eS5zd2lzc3NpZ24ubmV0L0NOPUVCMzVCMTU2NkQxNTYwNThGNEUxMjJDRDFDNDYx Q0FFRDAwNDAwNjUlMkNPPVN3aXNzU2lnbiUyQ0M9Q0g/Y2VydGlmaWNhdGVSZXZv Y2F0aW9uTGlzdD9iYXNlP29iamVjdENsYXNzPWNSTERpc3RyaWJ1dGlvblBvaW50 MGQGA1UdIARdMFswWQYJYIV0AVkBAwEEMEwwSgYIKwYBBQUHAgEWPmh0dHA6Ly9y ZXBvc2l0b3J5LnN3aXNzc2lnbi5jb20vU3dpc3NTaWduLVNpbHZlci1DUC1DUFMt UjQucGRmMIHZBggrBgEFBQcBAQSBzDCByTBkBggrBgEFBQcwAoZYaHR0cDovL3N3 aXNzc2lnbi5uZXQvY2dpLWJpbi9hdXRob3JpdHkvZG93bmxvYWQvRUIzNUIxNTY2 RDE1NjA1OEY0RTEyMkNEMUM0NjFDQUVEMDA0MDA2NTBhBggrBgEFBQcwAYZVaHR0 cDovL3NpbHZlci1wZXJzb25hbC1nMi5vY3NwLnN3aXNzc2lnbi5uZXQvRUIzNUIx NTY2RDE1NjA1OEY0RTEyMkNEMUM0NjFDQUVEMDA0MDA2NTANBgkqhkiG9w0BAQUF AAOCAQEAZ0MGkdgklsqIGz+IDYrscy0yWH3T3eWARmNRqDACkKzV4aXB38BZS06E RVwWb58YC+zWQPSUJSrwUxfYJ5Q12r/N7reJ3YaauK1Gi/aGKdNzcMNKUCh3u7UH Yb8FAreve5SqucQNLaE5xtQ/j5lYPIrcPnRTdzxOj99htZPRJNQ4d2zGMg8DQXvt gyTuKkbRMjePmUupU+mgVnimtyVxLZhpirZRHiV1cBRZLM+DCyNKWGrelWGcCHlw 5+sQTBW1vcdjyeeufD5lxzafbSVNcVCxe/RxskH/c76fmgMwcfHoJxHQRuzlDR69 dkMqxPEk5Q3leVZwhinah2RaoUpoDzGCAmswggJnAgEBMGkwVTELMAkGA1UEBhMC Q0gxFTATBgNVBAoTDFN3aXNzU2lnbiBBRzEvMC0GA1UEAxMmU3dpc3NTaWduIFBl cnNvbmFsIFNpbHZlciBDQSAyMDA4IC0gRzICEADPkdA4dsHEKMIVBksqNXMwCQYF Kw4DAhoFAKCB2DAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJ BTEPFw0xMjExMDUxMDI2NTRaMCMGCSqGSIb3DQEJBDEWBBSl2uk+TOtgu68+IwVt SCjqdIzz6TB5BgkqhkiG9w0BCQ8xbDBqMAsGCWCGSAFlAwQBKjALBglghkgBZQME ARYwCwYJYIZIAWUDBAECMAoGCCqGSIb3DQMHMA4GCCqGSIb3DQMCAgIAgDANBggq hkiG9w0DAgIBQDAHBgUrDgMCBzANBggqhkiG9w0DAgIBKDANBgkqhkiG9w0BAQEF AASCAQCn+e8Cxm6iT12faNkCL5/YI77eyjWzwmC1ApwuXy+INDLZg67ErxYOiwjD QJ2gBUaYSVM4ShQci0XPW/6GpRZCkNDibFh8D3/F58dJ7gCvwFLr3nqAQRkS1D43 iNchQMBQN9uS6qy6TwOEJNX9Co4Z7PnNxz7lK5Xg9ypZDM3B2NOLeWe3NAmezRzS EVlNBGslJG8hFrew+jwEJq+6bqksguLU/9v9yxW5V6IJYlSlOVZ63JYDh/nS/mGN CumdP7xCQWfEJRNtIeWFqjkugcz3oFvljvtHAUMg68y9gK0LH+lSLJYjZsSH+3ok 49GtkVBeTlvwDKeMx0tniFFf5Neu ------4B999752A792420B9869C0FC695DEA27--