NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems


Neural Machine Translation (NMT) systems have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of NMT systems, which is of paramount importance due to often vast translation demands and real-time requirements, has surprisingly received little attention. In this paper, we make the first attempt in understanding and testing potential computation efficiency robustness in state-of-the-art NMT systems. By analyzing the working mechanism and implementation of 1455 publicly-accessible NMT systems, we observe a fundamental property that could be manipulated in an adversarial manner to significantly reduce computation efficiency. An interesting observation is that the computation efficiency of NMT systems is determined by the output length instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations, and a runtime generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that NMT systems would have to go through enough iterations to satisfy the pre-configured threshold. We present NMTSloth which develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level, which sufficiently delay the appearance of EOS and force these inputs to reach the naturally-unreachable threshold. To demonstrate the effectiveness of NMTSloth, we conduct a systematic evaluation on three public-available NMT systems: Google T5, WallenAI WMT14, and Helsinki-NLP translators. Experimental results show that NMTSloth can increase NMT systems’ response latency and energy consumption by 85% to 3153% and 86% to 3052%, respectively, by perturbing just one to three tokens in any input sentences.

In the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.