incubator-rat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <robertburrelldon...@blueyonder.co.uk>
Subject Re: apache-rat-pd
Date Tue, 16 Jun 2009 21:51:52 GMT
Robert Burrell Donkin wrote:
> Marija Šljivović wrote:
>> Hi!
>> I am working on copy&paste(plagiarism) detector.
> 
> cool
> 
>> You  can see information about project and reports of my progress on this
>> locations:
>> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
>> https://issues.apache.org/jira/browse/RAT-45
>> or get source code and binary distributions on:
>> http://code.google.com/p/apache-rat-pd/
>> I think now to make some misspellings heuristic checkers. This algorithms
>> will be able to notice some misspelled words in source code.
>> Then this part of code will be sent to some of code search
>> engines(GoogleCodeSearch for example) to check if it can find any similar
>> misspellings in public code bases.
>> On that way we can check possibility if code part is plagiarised.
>> Now i search for an open source library which can be used for this task. I
>> found one: jazzy ( http://jazzy.sourceforge.net/ ) and I think that it is
>> good for this purpose.
> 
> probably best to make the API pluggable (jazzy is LGPL but this is good
> advice in any case)
> 
>> Any suggestion for other solution that is better then jazzy?
> 
> i'm not sure whether it would be better but an alternative approach
> would be to use a semi-structured text analysis tool for example UIMA
> (http://incubator.apache.org/uima/) or lucene

for lucene, start by looking at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker/
and then create a custom dictionary by tokenising a large number of
source files

- robert


Mime
View raw message