SCDetector: Software Functional Clone Detection Based on Semantic Tokens Analysis

Abstract

Code clone detection is to excavate code fragments with similar functionalities, which has been more and more important in software engineering. Many approaches have been proposed for detecting code clones, in which token-based methods are the most scalable but cannot handle semantic clones because of the lack of consideration of program semantics. To address the issue, researchers conduct program analysis to distill the program semantics into a graph representation and detect clones by matching the graphs. However, such approaches suffer from low scalability since graph matching is typically time-consuming. In this paper, we propose SCDetector to combine the scalability of token-based methods with the accuracy of graph-based methods for software functional clone detection. Given a function source code, we first extract the control flow graph by static analysis. Instead of traditional heavyweight graph matching, we treat the graph as a social network and apply social-network-centrality analysis to dig out the centrality of each basic block. Then we assign the centrality to each token in a basic block and sum the centrality of the same token in different basic blocks. By this a graph is turned into certain tokens with graph details (i.e., centrality), called semantic tokens. In final, these semantic tokens are fed into a Siamese architecture neural network to train a model, and use it to detect code clones. We evaluate SCDetector on two large datasets of functionally similar code. Experimental results indicate that our system is superior to state-of-the-art methods and the time cost of SCDetector is more than 14 times less than the state-of-the-art approach in detecting semantic clones

Publication
In the 35th IEEE/ACM International Conference on Automated Software Engineering.
Date