Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 71174 invoked from network); 3 Sep 2009 22:56:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Sep 2009 22:56:55 -0000 Received: (qmail 62642 invoked by uid 500); 3 Sep 2009 22:56:54 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 62564 invoked by uid 500); 3 Sep 2009 22:56:54 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 62556 invoked by uid 99); 3 Sep 2009 22:56:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2009 22:56:54 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of twdsilva@gmail.com designates 209.85.219.222 as permitted sender) Received: from [209.85.219.222] (HELO mail-ew0-f222.google.com) (209.85.219.222) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2009 22:56:44 +0000 Received: by ewy22 with SMTP id 22so403738ewy.28 for ; Thu, 03 Sep 2009 15:56:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=4IESd8K6BJasDMOUcLLHjaL0V3S2qe3BuyTb6eDCQsw=; b=wwCZx9oUD5vG1aDcFZvKjCBlcy9vEwf/KYFKfKkgXYTY40gVbQkcrNzrKIpb8mGFPr QMOVHS4T5ZK0f+xABnstOx8TkgEPG0bRcvNJnC+Ax4SHpoQEPCBGefYW26gh4mIZvzXU 0hEbdy2eP1zOgz/qEDshpWOA02Xl5qDQMtW8c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=Bg6vLQOoHh5TdGYc8gIvxvmIlBVqCY80qXEwNJChl5BhERxbRgrOD83lCOupUIzzsu XNXg4KVBYaCLLa2LH7dhePJqrCL94z5akhK/+MBICGDMsV+MsrCRoOsybAZ7JU0iX/Uk DXj6Gm1WpsNm9fdzpRglQjXww3DXmJplZq490= MIME-Version: 1.0 Received: by 10.216.29.72 with SMTP id h50mr617797wea.137.1252018584347; Thu, 03 Sep 2009 15:56:24 -0700 (PDT) Date: Thu, 3 Sep 2009 17:56:24 -0500 Message-ID: Subject: MoreLikeThis Extension for documents that have tags From: "Thomas D'Silva" To: java-dev Content-Type: multipart/mixed; boundary=0016e6de0013f8e4630472b44acd X-Virus-Checked: Checked by ClamAV on apache.org --0016e6de0013f8e4630472b44acd Content-Type: text/plain; charset=ISO-8859-1 Hi, I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. I have attached the class and a test class and would appreciate any feedback. Thank you, Thomas --0016e6de0013f8e4630472b44acd Content-Type: application/zip; name="MoreLikeThisUsingTags.zip" Content-Disposition: attachment; filename="MoreLikeThisUsingTags.zip" Content-Transfer-Encoding: base64 X-Attachment-Id: f_fz63ibg90 UEsDBBQAAAAIAIKLIzs4dMjd7wMAAPEMAAAdAAAAVGVzdE1vcmVMaWtlVGhpc1VzaW5nVGFnLmph dmGlVktv4zYQvgfwfxj4JGddpsk1CFADzgIB8kA3ThfbogdaYiQ2EqmQVOzswv+9w4dk2pbVLMqD TXFmvnkPOTrhVS2VgX/oGyVckpuH63XKasOluBydjFqyVDmhNU0LRsomZYIRKmj5rrkm2lCRUZWR x7CZWcp3pi4HpDOZNhUThszD5kPMnzkrsyFOLjK2Jjf29wuj2bANEfNXxc1PMpM7unb23DKRm2JI VjOq0sILP7r9sK7A/3vD1PsH+DSveEnRIKnYLX9hi4LrQTGDjOTL7G7OFUvx491lum6WJU8hLanW sGDaxHhPmot8QXP4MToBgFrxN2oYxCCQRXART5QNUG1S9ultWEBv4xO4pEFUlsGb5BmSzVOdTMAU Sq40dLUa7LKrMwOuQLDVjo3J5LLji3IJK//nBSJC0oFNHWm/xJPJtMNrl1ENm8JudZCn+9ubu5vF 9XzSOmbX2RnMMuuU4kyDfAasdA0rbgrQNUv5M+aDi2epKuqcxB3EWemAaJZhIyXejSmMU6kE5Ipy MZ66j3HkeC9zRfl35kVgVTBqwEgh0CqaK542pWkUC1ie9z8RG53RHukhOSeisWKLpoKM1VQZ2/d4 9L5k1PoSiEMggdnr9654t4JDy0YXrNSQKlk7RI89gNgD4sAjCy0YsLXrOB+e6Xgbpg7ZQ5K0lJrF tej7AusvahYiaybiArSFFcm0nRJXbdtGiQfsdG/6u8kg+1yuxHBDeaxDo1sDDimd0bukrRW+850N IdiH3TjFblM4ddDKtdl+0Fx31kaXVWRvu9qrxXZVCFJ7FBvbLteuTpnbXVl+t0vG9nA8dXa4E7zq 7Ai9fwifzngyu5/dfvvzeh4OF0xVf7gwkG/Xj8cVokOHCvEQFeLvjkLE+V8aMRAE4510Tg7xtHb1 8IQy9qnzAUWx/Sz7+yQUWv91oreVt7B/dFmyKJPdJpaFqjQhnfFxVPJBiGPqeXb16/YEJfHS9K7f 04rpxA91W1l//f3D53kzORC44+KrVHaWJxe9VBv5z4q9Jue9ZIzSltrrVBcQK2X/Dh3cxsx7Ot2z 3ZbMZkeBh/o5jw3em4et5J4i8HqOZlmvSjQp4X2VERgvPKPVfpz38V0bVhHZGIITQZhS+MKH8acw ctqHn5UneXAjdMfkg4ho9PjT6/lHuS8sd5zkDaTUpAUk8bBhk54g4W2+eJg/wKwx8pccX1qK2knr xZelTF8ORJjXjM+K9GWhaLozRTc77XR2euopp/AbXju0Aqpy3R3td1EgnHVooR3xtW7arsQLSiRt LTi4wW5s19G3oe1yN8WOcfSNXStEwrPuCHVgehyVaC+2nXCOTjb/AlBLAwQUAAAACAB8iyM7RmDd uEISAAALRwAAGgAAAE1vcmVMaWtlVGhpc1VzaW5nVGFncy5qYXZhzVv/b9s2Fv/ZBfo/cBmwyamr pD0cDmjqbl2SrsE1Sde42w5DUdA2bXOVJUWSk7pd/vd775EUSYlynN42nNEvtvj4vvHx8cNH6v69 +/fkMs+Kiv3Or3gss/jk/PjjROSVzNIDv3FVySR+XhR8/UqWVWdjGWo5zJJETLqY2tZg55e8XJzy vKvpQgSV6VJSkdcNWTGPec4nCxEnq4lIRSzTqfgYn+C/bwSfiuLgVuKRKJbbUR2nqy0pXxTi8mfw SrZRfil4MVnEP2RZInh6mPBVKban/2klivUW5OSMC/q+2R2aflu+aOettDRqrwuZFbJaA/VK0Pjt 7e7ev8fYLhstBJskvCyZLBmYP2VVxubQs+CVYKVcyoRjV3YJkqRAGpnOWQXdKj4vGbt/DzpnEwnk U3YtqwWbZpPVUqRVGZMIK8c0sAo0LxUxMlrI+UKUFZPpLCuWHEOZzblMWTnJCsHgIeMoaC6vRIpi GYfHE55MVgmJHXNUHHpVrpRJPTGUIuwkBQdBSHomcrJsraWYzgNlYZZbdqCA0rshHDvW7mj6wlVI KcHTaS0TXG70mDadVTewnSV6IZEfQMZClju68yRLK3BSuUnTgDakptG501dWj45BAdUn2TJfIW8V Ei1CTtExFeWkkGMgk6niunMKASpTwV5BHKeqK3u+qrIlaToSHyt2CF/nELOfiN8OG6/ZCz6G3zJj FwLGu5I8lY6qu3sY1vlqnMiJDuhT0PMVeG0ETnuLKo7Q9M/37/Xgj54A+NllL6RIpizlS1HWD/f0 t7zIKvAMKHZRFcDkt3dshuRnSM2GLF0lCU0povbYnsxTdNU1BB1MrxlLRImjxVMaR/iZzqtFQKC8 wsCUMCxLmf4CvV9B2A/Z0fGL529fjd6fnpy9/+X8zdH7V8dnP45ebil8XgjgWtxRPv8YkP/81y3k H2XptxUrRLUq0jrgkyyduzrEt0mn/DaiYPYV+Ont8Zv/vB8dvzm96FDgjRFNPNQESDNGk8k6YcnT tcpHVhfL4/tSCOam+6/nojrlH9VScZitcFb3DKFutDq7bWWjrdc2XQVvWUHMTyDIUp6QGzrMBoc8 /uc2gx8MO8iEGBX7mAZ0wwImLDgI5qyYzSDk46ZpdTA27aobvsAoP5jBqP1tbOqK5i8xq47x9nB9 uVn+HKnNUnkHxRyJGV8lFQNBLJsBpywn07R6JzPKLGwJkVfigsWTJLsmMubQNVS+gGbUuWzY6D8H I3ohAwDc1QZcjM5fkwUXbo6z6h+uigIXDkd9Vy8Soacysi2NAs40tiIONudkHOXKrHcOioB07g5M L5CqgeZFna1DYmhNIP7XAqIL1g05TxUKqmj1gyXOQQWrIhmAtxKBgQYhVVYQiMt4CzWC0k0QAK4d QzSDIxsYiTRb8pwigKAPPvfFbRGKR+eHKmm8f3385v3o+Y/0C8bi0X5ILch4pdEAPK5ydw2yCAB4 s0iW1KMRjPqZVZW+hNQdqwxbq/zD+fnFCLSb8YQAeaffclEAygW3CNd32m3cdyZbrkrCRjzPAXig YeDUscDAKiXAQjG1AlTLEkmnd/T9NIOHwktw4H7y/OHxGfj+GLNBvP8oZJhatBSmmzVnwLWEhABq EUy/xgSnQaJB5GDct4h6GMIehrjnWzV4NkAVENfu2ywBcufn7xOZfkC+Xxt7Xpwcvzp6f/b89Pji Jrhk6iX/L7OgEUj1HIM0Z6d61Eegx8xH6zRzUwE13ITXGkhZf+UYYEaktQkii7K8weOeOArBqcCg k5iSOEVvAZhDLoG0yJbUx9lrWwmQg64gpqcU5QsV41Wxwg1xcNByXvClC2y3NB53Zb79TeOtkM3j eJXJKa4mzhgGALc/rAQhPTS+9Qi7Ew3cK5erJa1fBkeMBS631ws5WWjIYYwHkyVBkWnsDeQ+jWKa WRkBvrE3/8IzrAGJbp1lX6x+11Dg6uHhvfBsspuTbWcTAHrSFPKyXl0aGfb/OzQBEB5lE0Lvr0UB yAG/Rnqn0m4JxGqQDsI2+PwuEfwljv07QijosXA0fYELbGDdSf+W8q1oc/bd27MHt3RK6IooO8X8 7X4odtxawNZzL5TnulATVwYSTtIgCYc6dVaurvz19Qa4c3sC09HrupePsyvxpdGnIZjKYRBVGFHa 4s5U5pPdNfC6XArLry2tMQ1LMVCajlbFsaaTdDTaKsz27vqyaGw4Szuy5Z5wePo0Kkrv5NZQSrtz UNy2KLf5brUo+xv6vz+mbUbdvCjX7XddlLfTdIvc+XdFq/WEX6oMr7tO7gy7yNY2VACJrtLM83St rJG6woR0VA83+0i2s0pBJ/haIvjYoaK/tQuZHONRhpyxdbYqgCNP1p9EoYo8pZU4wHaYRvMFIByO G9kM1nfwk95puCVukD/FgwAtCEmWA8Z1KcjsMPJVkWcloCa31uAc80gs6QgAEIXgZZZynP5Vpo95 gFrlsB1uKz5oeiquBOYwa7JXnlKBYWtATccO0BO0EQLxoYqXzpilyxTLDO3zLo6uLGHIseT1Qiag 0ddL/kHgTxxX4GRbYqcl6rvlC7dm1rMVLT8Ca4rIK3FR/PV6FHhu3av+jhv/Rsj9qENuYspqdT0N pgTiV8S1Xs2lVe1zFER1XAsirZLOEU1Fer3t07GtVKmqkFOmcpNZOklWevuJBXZ7mLWpHBGCE37N e+u0+yer2ZmLbT2/Mx1bkrtm5C2MqC0gazYbEcrUzvnKlpLdTO27z/Lf1o9OKncc2Tr6CSd072io y8+tAttzTJ6msIkm050BV0GnfqsvSDxFrs8wXRL7YSqu/aaoH6rlnfJcV2l06RCTWiLLKlDnzQq1 cIzXaiRDZ64blARRpMngNZfFb++eoUj8PcpGWX7k6g3cQx3CFng3F+rbAfrseoPfVCHU713aCxBh OaqEhQJAUDtiQsx1F4lMw3PqtBXQdot8e229Iblzpw8BCH69veB+EFbydcfmobuK7ZWwQfT/VMVu WNmF+Rsn0I1NXm2atQwy2iJbJVN2LVDhZA3yx+pkQR2u/OSfKxDkwkAvvzMctHbmJMmcEyguQ/+8 wA0sOqGyJc9W5pOFF0AmsvlkgmelfmibPt5Rkh2UViXKnFM1j6eUm1XiC15NiPyYHtjSNoywJ90m RIUyZDGkOcDMR2EPPeOGDCe+NxsjWfRb9J6IYdU4OatpYcdbSZ7ITyKqeThoRhHp1ex6IaqFmtTj jWdKOzTwO4w2a9WGNUuzKVZia6YD9B8dJbEMtbmWZcfxlZbVsUoRUeTFYGBlMrFJ/3v+CXOlkfcO Cpuji89wOAL4weXljovlAMA6civk6jDX1xs/e3uw0tBYmRXqCnhNVXiXah1TELvR0V4/NLSQDGUR +8cx7nSL6fk53ZKMT86Ojn89PvKiET9eVb/2T1/LiKuMrktGGNm6UbeU5IB3HsMb4w3Euni18anq 88yf1NgyVNcwY06/In/aGZ4+j9kmBrNAb/M/Lh14pZECGL8M9UE7bB4gjgmy90wbuAG8WimoSTQ9 2EBDoo4MRZyKj7CN0Vi/11P6AfOP1bCmwS9RP66I9MAjJFVblPS0JsVw+qrltNjcRVPm9tk337CG W4IkimcPDYN8inc2o4b4/rPWOrQrC11Hd2zt9Qw6i/l02uKitae9jt569W7YhFeTBYucS7tMGI4w H0bnR+d0L+2hxbOqyzjJJh+ITMSwNqWw0+KTD6OCT4T2lIk4d+I3rjekU40/g/fr2hBRXVC0dyDx qbOReoubeo63XRaYGWGPp2oA/tVAl1UNS5lz/yGBff90Xd/t25CJNcRUMhyEq7t2yG4lWQVAcTvl QNVIcTbnKNWiwLKIHSYvvYWwbh1u/xZrQ9Dvm15mIxzsCZrUPVpTFiEgxBVG5NCJwwO3XXc+UhE9 dIPbssWAJZz1+j08HepKa9/vu2cea5HUr75E/hRd94zyNUH7RgNphR1wtJVD9Rx5YuaKCXb0oizP MlgYsS4Q6XadJdBvNMXQpzKlG8Q97ZWeawd0qQ1xrNbc+kFr/O758NFD9U010j/u/Tu1xD9iCtO4 LSZFKQLKAkiCklW7cf7AdFOX+OLzyWRVxKdvL0a3cTCG3MpBR8mjoYFgGotFivXAhBDk4aziyUtZ lY69bmD4TkWeG71IuWO4H+vLZmpkHUbYpEfTHUwtWKv9eOhH4UMU2xwszTF3dXsc1s2wV8rZvrun ADjiJJs7Gu6p77sqCEzi9q3IvyIzTOontg9u4Zv7jHPD+caxHoQckfoxzAV+FhHjPvvjD1Y/Pkln CLOEbqt1oARFIVnnqGjnuChg5pmb4rQUQzpE8foCNiTNnQdeTjBrUw8ndR2BOJ1N8A20ZLPK9HrO +x9xmRVVhH1Vu9tUYIm2FE4r/pVU/f0khiRQYaen4T3ud8yhedJ5Xtvr6YxeOZWHSD3rR4rFakzY aH9AsvsemiNKfPxOGxlM0rDCmAk9cASZlGfyu20xGbqj9kbLsaq3qsURrwk/xDv3D6nEr3YWuFKL TRdoYJDOVmpJBOwMazh+U7DZ2TEillaHH7BczlAy0zf2W6ujBvjeCxxYWywkDCcZRuUyJXfjQmkG er9eq2hdcLHfk8pDuw6A7mHXB0MF61Gsfb8mUrIHikVfx0dz4cQP9vmJ5Ze0WNGPKBhEuxQUB07P 5pIH46riqmvZq3veZqO3D+r5ltFU1T83m477S7LeFe35F1k9cfgZZqXGsLYXKWEwxNAsQAo1Dyrh gBJFrmebWdJpC9IAU000ozqSetib5U+87k2FVGo0PrdAPm+prtcQOayJqZRyPgPagyZlTYOAS0Lg qDyeqy9N8pvGb4Hb+Zb0miemzrZMn4n3y/1hcwR+XD8Z/irxG4r8EuwsBaReHK7z8e8QHb+9+5zH Po4asMYH9hiUIDDvOOzIEmRb99d7sCYD6G9ANlEEeNBUSzJeRdqv/Zvmztj5pdNmfnlLtjyk2/ck nK5U0btJOkVicuN+wmpnNV3PU5BuQtwUxPIT3aUfYG0g2IUD6y5qMPBYzX2KQXpJCVdfzK9bZugr NhZldUHbMO91BPzoPXcUAUtovYzzLMc5/JUurLQrKyYeGMcOkfnZbyqFHxxtZV51qW2rH9VgNNJJ pQ8cf3v0jtKP+2hfrZwN1jiHAzUrt9m4BIwOE+HH9U0UUWyR1Mfv+jF572eerIQ/CuZzE3imPL5c b8kzwKG6jOvqnOGzZ/VsaXLT4qKrLaz5UadXVFC4DIP+i5fnb1+1q1dNQ3WlwY3TeJRlpzxdK3al PrUP+30MM+TD7Wao8XvwoEmJY+sfmD1j+1iA0QP+bLjpvM18upRwv7dzCbkwmE7s1WZ6P40uJ/CS KvG4AuIrPZBWkrU+xQAwZe4XYKkXYt6+F7lDRQdeyBKQ7v3WxQPiN9KJVp+nu8clPfcQGSvLoE6p ji7sfZSBehvBb1P35Emb2aqg8qm5gHC/+RqMKRy7G20HHeC27LOB5AldVMHHsboyo3e3NJL2HqAe RqR+2rw82LMouNC7vJuahb0O47B41rxDs4mFvdigMx9yqR9akKBs28jOXM43L3t0LDr1+3wq89tj cN0fFsvSvCVbh5LG3jA+Nfw+OdoOwJsOJ0dtIM/cN4j1XZzG7eIdu2dwqnQmzP5UKxqbBbWAII/t 9gb40dLdpdjfZWgudabrGKS3pXpfewYwdyXSCb4xjpef8HCNTikKMOQaY/4EJvMcdqQBM+qjSP1e jXqdlyCYXOaJUKejhzTl8YKS3hW49hCI1vDJzVlHTp3EX01ok90AtJqBKcvbBrfOUj+9afNz1Rh4 whuS6KDHiAvorW8TkdgWPvYF6ygw6QbGASaWhh0sG//ekCxnEe1wYdXHRrepMV29PhEQ29OkP/5g +ACh/CEOVkSAiHS2j/pB5mba2xaAtZnSls66Ic3S0HN9qzPPZFq59NRaCTp1I6Deh+4eRzu3CE5r l2AX4/M+hDGeSKh321wCNVq7e7c6HOcaVt8Ps2nzlnHPtIGG//I00w//8QjmDn1/UGtped3WQynd Qa9tx9aWCb2eo7taQsUoU5udrJGzlRBLlGnPmDxuudI3/4KDh+op59GFl9JccVvXLxXqLs2dgpcJ VCEB9lQihc4+c9ftio6SYADTOOelZb/lmvqXzUZmRuH71KMFT82c4gOzzRhvQv/cR//8oItyPPYp x01KAsgwcZDMgGUOYLmDbuzQjccBunomemAb8MBs3AnpnRGHP/8FUEsBAhQAFAAAAAgAgosjOzh0 yN3vAwAA8QwAAB0AAAAAAAAAAQAgAAAAAAAAAFRlc3RNb3JlTGlrZVRoaXNVc2luZ1RhZy5qYXZh UEsBAhQAFAAAAAgAfIsjO0Zg3bhCEgAAC0cAABoAAAAAAAAAAQAgAAAAKgQAAE1vcmVMaWtlVGhp c1VzaW5nVGFncy5qYXZhUEsFBgAAAAACAAIAkwAAAKQWAAAAAA== --0016e6de0013f8e4630472b44acd Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org --0016e6de0013f8e4630472b44acd--