Application of Seq2Seq Models on Code Correction

We apply various seq2seq models on programming language correction tasks on Juliet Test Suite for C/C++ and Java of Software Assurance Reference Datasets and achieve 75% (for C/C++) and 56% (for Java) repair rates on these tasks. We introduce pyramid encoder in these seq2seq models, which significantly increases the computational efficiency and memory efficiency, while achieving similar repair rate to their nonpyramid counterparts. We successfully carry out error type classification task on ITC benchmark examples (with only 685 code instances) using transfer learning with models pretrained on Juliet Test Suite, pointing out a novel way of processing small programming language datasets.


Overview
This manuscript will illustrating our preprocessing method of the Juliet Test Suite and the computing overhead estimation and also provides a rough estimation of different methods' computational overhead.
In general, the preprocessing involves following steps: • Identify and extract function from code files • Delete comments • Function name replacement • Delete unnecessary white spaces and newlines • Parse string into list of words In following parts we will give an example to illustrate each step in more detail. We only present one example for Juliet C/C++ Test Suite, the Java Test Suite will be preprocessed in the similar fashion.

An Example Code Instance of Juliet Test Suites for C/C++
Following is a simple example from case CWE390 Error Without Action calloc 01.c, after we have extract the function part of the code. If there were multiple .c files in one case, we concatenate their extracted functions to make it as one instance. First we remove all the comment, then replace the function name with main. We replace all function names as main to avoid the model getting any information about the flaw from the function name.
Sometimes the instance contains multiple functions, with a sink function or a helper function, we keep those functions, except changing the function names to simply sink, helper etc.

POWER AND RESOURCE OVERHEAD ESTIMATION
Here we provide a table that contains estimations of the time and power consumption for all different models we trained. The power estimation is an overestimate that may be larger than the real values: since the power consumption varies when running, it is hard to estimate exactly. For all methods listed, we use the computation resource of 1 GeForce GTX 1080 Ti graphic card, and 1 CPU for more than 95% of the time during training. According to tech specs provided by NVIDIA, the maximum GPU power consumption is 250W while the required system power consumption is 600 W. Therefore, we use a value of 1kW for each hour of running for power consumption, as estimation. The time estimation is the average training time for each of the models for each fold in our four-fold cross validation.