Empirical Software Metrics for Benchmarking of Verification Tools
Yulia Demyanova, Thomas Pani^21), Helmut Veith, and Florian Zuleger
Vienna University of Technology, Vienna, Austria thomas.paniOtuwien.ac.at
Abstract. In this paper we study empirical metrics for software source code, which can predict the performance of verification tools on specific types of software. Our metrics comprise variable usage patterns, loop patterns, as well as indicators of control-flow complexity and are extracted by simple data-flow analyses. We demonstrate that our metrics are powerful enough to devise a machine-learning based portfolio solver for software verification. We show that this portfolio solver would be the (hypothetical) overall winner of both the 2014 and 2015 International Competition on Software Verification (SV-COMP). This gives strong empirical evidence for the predictive power of our metrics and demonstrates the viability of portfolio solvers for software verification.
1 Introduction
The success and gradual improvement of software verification tools in the last two decades is a multidisciplinary effort - modern software verifiers combine methods from a variety of overlapping fields of research including model checking, static analysis, shape analysis, SAT solving, SMT solving, abstract interpretation, termination analysis, pointer analysis etc.
The mentioned techniques all have their individual strengths, and a modern software verification tool needs to pick and choose how to combine them into a strong, stable and versatile tool. The trade-offs are based on both technical and pragmatic aspects: many tools are either optimized for specific application areas (e.g. device drivers), or towards the in-depth development of a technique for a restricted program model (e.g. termination for integer programs). Recent projects like CPA [10] and FrankenBit [20] have explicitly chosen an eclectic approach which enables them to combine different methods more easily.
There is growing awareness in the research community that the benchmarks in most research papers are only useful as proofs of concept for the individual contribution, but make comparison with other tools difficult: benchmarks are often manually selected, handcrafted, or chosen a posteriori to support a certain technical insight. Oftentimes, neither the tools nor the benchmarks are available to other researchers. The annual International Competition on Software Verification (SV-COMP, since 2012) [2,3,8,9] is the most ambitious attempt to remedy this situation. Now based on more than 5,500 C source files, SV-COMP
© Springer International Publishing Switzerland 2015
D. Kroening and C.S. P&s&reanu (Eds.): CAV 2015, Part I, LNCS 9206, pp. 561-579, 2015. DOI: 10.1007/978-3-319-21690-4.39
562      Y. Demyanova et al.
has the most diverse and comprehensive collection of benchmarks available, and is a natural starting point for a more systematic study of tool performance.
In this paper, we demonstrate that the competition results can be explained by intuitive metrics on the source code. In fact, the metrics are strong enough to enable us to construct a portfolio solver which would (hypothetically) win SV-COMP 2014 [2] and 2015 [3]. Here, a portfolio solver is a SW verification tool which uses heuristic preprocessing to select one of the existing tools [19,24,32].
Table 1. Sources of complexity for 4 tools participating in SV-COMP'15, marked with +/—/N/A when supported/not supported/no information is available. Extracted from competition reports [7] and tool papers [14,17].
Source of complexity	CBMC	Predator	CPAchecker	SMACK	Corresp. feature
Unbounded loops	—	N/A	N/A	—	£SB  £ST ^simple ^•hard
Pointers	+	+	+	+	PTR
Arrays	+	-	N/A	+	ARRAYJNDEX
Dynamic data structures	N/A	+	N/A	+	PTRJ3TRUCT. REC
Non-static pointer offsets	-	+	N/A	N/A	OFFSET
Non-static size of heap-allocated memory	+	+	N/A	N/A	ALLOC_SIZE
Pointers to functions	+	N/A	N/A	N/A	'/Tlfpcalls 7 'fl'fpargs
Bit operations	+	-	+	-	BITVECTOR
Integer variables	+	+	+	+	SCALARJNT
Recursion	-	-	-	+	'/Tlreccalls
Multi-threading	+	-	-	-	THREAD.DESCR
External functions	+	-	N/A	N/A	INPUT
Structure fields	+	+	N/A	+	STRUCT _FIELD
Big CFG (> 100 KLOC)	+	N/A	N/A	+	'^cfgblocks i '/Tlmaxindeg
Of course it is pointless to let a portfolio solver compete in the regular competition (except, maybe in a separate future track), but for anybody who just wants to verify software, it provides useful insights. Portfolio solvers have been successful (and controversial) in combinatorially cleaner domains such as SAT solving [25,33,37], quantified boolean satisfiability (QSAT) [30,31,34], answer set programming (ASP) [18,27], and various constraint satisfaction problems (CSP) [19,26,28]. In contrast to software verification, in these areas constituent tools are usually assumed to be correct.
As an approach to software verification, portfolio solving brings interesting advantages: (1) a portfolio solver optimally uses available resources, (2) it can
Empirical Software Metrics for Benchmarking of Verification Tools 563
avoid incorrect results of partially unsound tools, (3) machine learning in combination with portfolio solving allows us to select between multiple versions of the same tool with different runtime parameters, (4) the portfolio solver gives good insight into the state-of-the-art in software verification.
To choose the software metrics, we consider the zoo of techniques discussed above along with their target domains, our intuition as programmers, as well as the tool developer reports in their competition contributions. Table 1 summarizes these reports for tools CBMC, Predator, CPAchecker and SMACK: The first column gives obstacles the tools' authors identified, columns 2-5 show whether the feature is supported by respective tool, and the last column references the corresponding metrics, which we introduce in Sect. 2. The obtained metrics are naturally understood in three dimensions that we motivate informally first:
1. Program Variables. Does the program deal with machine or unbounded integers? Are the ints used as indices, bit-masks or in arithmetic? Dynamic data structures? Arrays? Interval analysis or predicate abstraction?
2. Program Loops. Reducible loops or goto programs? For-loops or ranking functions? Widening, loop acceleration, termination analysis, or loop unrolling?
3. Control Flow. Recursion? Function pointers? Multithreading? Simulink or complex branching?
Our hypothesis is that precise metrics along these dimensions allow us to predict tool performance. The challenge lies in identifying metrics which are predictive enough to understand the relationship between tools and benchmarks, but also simple enough to be used in a preprocessing and classification step. Sections 2.1, 2.2 and 2.3 describe metrics which correspond to the three dimensions sketched above, and are based on simple data-flow analyses.
Our algorithm for the portfolio is based on machine learning (ML) using support vector machines (SVMs) [12,15] over the metrics defined above. Figure 1 depicts our experimental results on SV-COMP'15: Our tool TV is the overall winner and outperforms all other tools - Sect. 4 contains a detailed discussion.
A machine-learning based method for selecting model checkers was previously introduced in [35]. Similar to our work, the authors use SVM classification with weights (cf. Sect. 3.1). Our approach is novel in the following ways:
- First, the results in [35] are not reproducible because (1) the benchmark is not publicly available, (2) the verification properties are not described, and (3) the weighting function - in our experience crucial for good predictions -is not documented.
- Second, we use a larger set of verification tools (22 tools vs. 3). Our benchmark is not restricted to device drivers and is 10 times larger (49 MLOC vs. 4 MLOC in [35]).
- Third, in contrast to structural metrics of [35] our metrics are computed using data-flow analysis. Based on tool designer reports (Table 1) we believe that they have superior predictive power. Precise comparison is difficult due to non-reproducibility of [35].
Overall SV-COMP score in parentheses
25% 20% 15% 10%
5%
0%
not shown for clarity & com-prehensibility
IP
CP
I
i i
4V
0%     10%    20%    30%    40%    50%    60%    70%    80%    90% 100%
reports correct answer
Fig. 1. Decisiveness-reliability plot for SV-COMP'15. The horizontal axis gives the percentage of correct answers c, the vertical axis the number of incorrect answers i. Dashed lines connect points of equal decisiveness c + i. The Overall SV-COMP score is given (if available) in parentheses.
Empirical Software Metrics for Benchmarking of Verification Tools
565
While portfolio solvers are important, we also think that the software metrics we define in this work are interesting in their own right. Our results show that categories in SVCOMP have characteristic metrics. Thus, the metrics can be used to (1) characterize benchmarks not publicly available, (2) understand large benchmarks without manual inspection, (3) understand presence of language constructs in benchmarks.
Summarizing, in this paper we make the following contributions:
- We define software metrics along the three dimensions - program variables, program loops and control flow - in order to capture the difficulty of program analysis tasks (Sect. 2).
- We develop a machine-learning based portfolio solver for software verification that learns the best-performing tool from a training set (Sect. 3).
- We experimentally demonstrate the predictive power of our software metrics in conjunction with our portfolio solver on the software verification competitions SV-COMP'14 and SV-COMPT5 (Sect.4).
2    Source Code Metrics for Software Verification
We introduce program features along the three dimensions - program variables, program loops and control flow - and describe how to derive corresponding metrics. Subsequent sections demonstrate their predictive power: In Sect. 3 we describe a portfolio solver for software verification based on our metrics. In Sect. 4 we experimentally demonstrate the portfolio's success, thus attesting the descriptive and predictive power of our metrics and the portfolio.
2.1   Variable Role Based Metrics
The first set of features that we introduce are variable roles. Intuitively, a variable role is a usage pattern of how a variable is used in a program.
int n = 0, y = x; int fd = open(path, flags);
while (x) {. int c, val=0;
n++; while (read(fd, &c, 1) > 0 && isdigit(c)) {
x = x & (x-1); val = 10*val + c-'O';
} >
(a) bitvector, counter, linear (b) character, file descriptor
Fig. 2. Different usage patterns of integer variables.
Example 1. Consider the C program in Fig. 2a, which computes the number of non-zero bits of the variable x. In every loop iteration, a non-zero bit of x is set to zero and the counter n is incremented. For a human reading the program, the statements n=0 and n++ in the loop body signal that n is a counter, and statement x = x & (x-1) indicates that x is a bit vector.
566      Y. Demyanova et al.
Example 2. Consider the program in Fig. 2b, which reads a decimal number from a text file and stores its numeric representation in variable val. Statement fd=open(path, flags) indicates that variable fd stores a file descriptor and statement isdigit(c) indicates that c is a character, because function isdigitO checks whether its parameter is a decimal digit character.
Criteria for Choosing Roles. We implemented 27 variable roles and give their informal definition in Table 2. Our choice of roles is inspired by standard concepts used by programmers. In order to create the list of roles we inspected the source code of the cBench benchmark [1] and came up with a minimum set of roles such that every variable is assigned at least one role.
Roles as Features for Selecting a Verification Tool. The developer reports in SV-COMP'15 [7] give evidence of the relevance of variable roles for selecting verification tools. Most often authors mention language constructs which -depending on whether they are fully, partially, or not modeled by a tool -constitute its strong or weak points. We give examples of such constructs in Table 1 and relate them to variable roles. A preliminary experiment in [16], where we have successfully used variable roles to predict categories in SV-COMP'13, gives further evidence for our claim.
Definition of Roles. We define roles using data-flow analysis, an efficient fixed-point algorithm popular in optimizing compilers [6]. Our current definition of roles is control-flow insensitive, and the result of analysis is a set of variables ResR which are assigned role R. We give the definition of variable roles in [16].
Example 3. We describe the process of computing roles on the example of role LINEAR for the code in Fig. 2a. Initially, the algorithm assigns to i?esLINEAR the set of all variables {x,y,n}. Then it computes the greatest fixed point in three iterations. In iteration 1, variable x is removed, because it is assigned expression x & (x-1), resulting in i?esLINEAR = {y,n}. In iteration 2, variable y is removed, because it is assigned variable x, resulting in i?esLINEAR = {n}. In iteration 3, i?esLINEAR does not change, and the result of the analysis is
^LINEAR = {n}^
Definition 1 (Variable Role Based Metrics). For a given benchmark file f, we compute the mapping ResR from variable roles to the program variables °f f ■ We derive role metrics that represent the relative occurrence of each variable role R:       = \ResR\/\Vars\, where R G Roles.
2.2   Loop Pattern Based Metrics
The second set of program features we introduce is a classification of loops. The capability of Turing complete imperative languages to express unbounded
Empirical Software Metrics for Benchmarking of Verification Tools 567
Table 2. List of variable roles with informal definitions. Type struct.type stands for a C structure, any-type for an arbitrary C type.
C type	Role name	Informal definition
int	ARRAYJNDEX	Occurs in an array subscript expression
	ALLOC-SIZE	Passed to a standard memory allocation function
	BITVECTOR	Used in a bitwise operation or assigned the result of a bitwise operation or a BITVECTOR variable
	BOOL	Assigned and compared only to 0,1, the result of a boolean operation or a BOOL variable
	BRANCH.COND	Used in the condition of an if statement
	CHAR	Used in a library function which manipulates characters, or assigned a character literal
	CONST.ASSIGN	Assigned only literals or CONST_ASSIGN variables
	COUNTER	Changed only in increment/decrement statements
	FILE.DESCR	Passed to a library function which manipulates files
	INPUT	Assigned the result of an external function or passed to it as a parameter by reference
	LINEAR	Assigned only linear combinations of LINEAR variables
	LOOP-BOUND	Used in a loop condition in a comparison operation, where it is compared to a LOOP.ITERATOR variable
	LOOP.ITERATOR	Occurs in loop condition, assigned in loop body
	MODE	Not used in comparison operations other than == and !=; assigned and compared to constant values only
	OFFSET	Added to or subtracted from a pointer
	SCALARJNT	Scalar integer variable
	SYNT.CONST	Not assigned in the program (a global or an unused variable, or a formal parameter to a global function)
	THREAD_DESCR	Passed to a function of pthread library
	USEDJN.ARITHM	Used in addition/subtraction/multiplication/division
float	SCALARJFLOAT	Scalar float variable
int*, float*	PTR.SCALAR	Pointer to a scalar value
struct-type*	PTR.STRUCT	Pointer to a structure
	PTR_STRUCT.PTR	Pointer to a structure which has a pointer field
	PTR_STRUCT_REC	Pointer to a recursively defined structure
	PTR.COMPL.STRUCT	Pointer to a recursively defined structure with more than one pointer, e.g. doubly linked lists
any.type*	HEAP.PTR	Assigned the result of a memory allocation
	PTR	Pointer to any value
iteration entails hard and in general undecidable problems for any non-trivial program analysis. On the other hand, in many cases iteration takes trivial forms, for example in loops enumerating a bounded range (counting). In [29] we intro-
568      Y. Demyanova et al.
duce a family of loop patterns that capture such differences. Ability to reason about bounds or termination of loops allows a verification tool to discharge the (un)reachability of assertions after the loop, or to compute unrolling factors and soundness limits in the case of bounded model checking. Thus we expect our loop patterns to be useful program features for constructing our portfolio.
Criteria for Choosing Loop Patterns. We start with a termination procedure for a restricted set of bounded loops. This loop pattern is inspired by basic (bounded) FOR-loops, a frequently used programming pattern. It allows us to implement an efficient termination procedure using syntactic pattern matching and data-flow analysis. Additionally this loop class lends itself to derive both a stronger notion of boundedness, and weaker notions (heuristics) of termination. We give an informal description of these patterns in Table 3; for details cf. [29]
Usefulness of Loop Patterns. In [29] we give evidence that these loop patterns are a common engineering pattern allowing us to describe loops in a variety of benchmarks, that they indeed capture classes of different empirical hardness, and that the hardness increases as informally described in Table 3.
Definition 2 (Loop Pattern Based Metrics). For a given benchmark file f, we compute £SB, £ST, £slmple, £hard, and the set of all loops Loops. We derive loop metrics rap that represent the relative occurrence of each loop pattern P: rap = \£p\/\Loops\ where P G {ST, SB, simple, hard}.
Table 3. List of loop patterns with informal descriptions.
Loop pattern	Empirical hardness	Informal definition
Syntactically bounded loops	Easy	The number of executions of the loop body is bounded (considers outer control flow)
Syntactically terminating loops £ST	Intermediate	The loop terminates whenever control flow enters it (disregards outer control flow)
Simple loops £simPle	Advanced	A heuristic derived from syntactically terminating loops by weakening the termination criteria. A good heuristic for termination
Hard loops £hard	Hard	Any loop that is not classified as simple
Empirical Software Metrics for Benchmarking of Verification Tools
569
2.3   Control Flow Based Metrics
Complex control flow poses another challenge for program analysis. To measure its presence, we introduce five additional metrics: For control flow complexity, we count (a) the number of basic blocks in the control flow graph (CFG) ?ncfgbiocks, and (b) the maximum indegree of any basic block in the CFG ?nmaxindeg- To represent the use of function pointers, we measure (a) the ratio of call expressions taking a function pointer as argument mfpcans, and (b) the ratio of function call arguments that have a function pointer type mfpargs. Finally, to describe the use of recursion, we measure the number of direct recursive function calls ?nreccaiis.
3   A Portfolio Solver for Software Verification
3.1    Preliminaries on Machine Learning
In this section we introduce standard terminology from the machine learning (ML) community as can for example be found in [11].
Data Representation. A feature vector is a vector of real numbers x e Rn. A labeling function L : X —> Y maps a set of feature vectors X C Rn to a set FCR, whose elements are called labels.
Supervised Machine Learning. In supervised ML problems, labeling function L is given as input. Regression is a supervised ML problem where labels are real numbers Y C R. In classification, in contrast, labels belong to a finite set of integers FC1 Binary classification considers two classes Y = {1,-1}, and a problem with more than two classes is called multi-class classification.
Given a set of feature vectors X, labeling function L and error measure function Err : Rs x Rs —> R, where s = \X\, a supervised ML algorithm searches for function M : Rn —> Y in some function space such that the value Err(L(X),M(X)) is minimal.
Support Vector Machine. A support vector machine (SVM) [12,15] is a supervised ML algorithm, parametrized by a kernel function K(xi,Xj) = 4>(xi)T(f>(xj), that finds a hyperplane uxj>(xi) — 6 = 0 separating the data with different labels. In the case of binary classification,
/ s \ Is
M(x) = sign I      WiL(xi)(f>(xi)T(f>(x) ) and Err = —wTw + C     £j (1)
\i=l / i=l
where sign(n) = <        ' , Xi G X is a feature vector, C > 0 is the penalty
11     if n > 0
parameter of the error term, function <f> is implicitly given through kernel function
570      Y. Demyanova et al.
K, and w, b and £ are existentially quantified parameters of the optimization problem
min Err, subject to L{xi){wT(f>{xi) + b) > 1 — £j and £j > 0 (2)
with £j measuring the degree of misclassification of point Xj.
The kernel function and C e 1 are parameters of SVM. An example of a non-linear kernel function is the Radial Basis Function (RBF): K(xi,Xj) = exp(-~f\\xl - Xj\\2), 7 > 0.
Probabilistic Classification. Probabilistic classification is a generalization of the classification algorithm, which searches for a function Mp : Rn —> (Y —> [0, 1]) mapping a feature vector to a class probability distribution, which is a function P : Y —> [0,1] from a set of classes Y to the unit interval. There is a standard algorithm for estimating class probabilities for SVM [36].
Creating and Evaluating a Model. Function M is called a model, the set X used for creating the model is called training set, and the set used for evaluating the model X' is called test set.
To avoid overly optimistic evaluation of the model, it is common to require that the training and test sets are disjoint: XP\X' = 0. A model which produces accurate results with respect to the error measure for the training set, but results in a high error for previously unseen feature vectors x ^ X, is said to overfit.
Data Imbalances.  Labeling function L is said to be imbalanced when it exhibits an unequal distribution between its classes: 3yi,yj G Y. Num(yi)/ Num(yj) ~ 100, where Num(y) = \{x G X | L(x) = y}\, i.e. imbalances of the order 100:1 and higher. Data imbalances significantly compromise the performance of most standard learning algorithms [21].
A common solution for the imbalanced data problem is to use a weighting function W : X —> R [23]. SVM with weights is a generalization of SVM, where
Err = t;WTw + C     W(xi)£i. W is usually chosen empirically.
i=l
An orthogonal solution of dealing with data imbalances is the reduction of a multi-class classification problem to multiple binary classification problems: one-vs-all classification creates one model per class i, with the label-
{1      if L(x) = i , and the predicted value calculated as — 1 otherwise
M{x) = choose({i \ Mj(x) = 1}), where a suitable operator choose is used to choose a single class from multiple predicted classes.
Empirical Software Metrics for Benchmarking of Verification Tools
571
3.2 The Competition on Software Verification SV-COMP
Setup. A verification task in SV-COMP is given as a C source file / and a verification property p. The property is either a label reachability check or a memory safety check (comprising checks for freedom of unsafe deallocations, unsafe pointer dereferences, and memory leaks). The expected answer ExpAns is provided for each task by the designers of the benchmark. The verification tasks are partitioned into categories, manually grouped by characteristic features such as usage of bit vectors, concurrent programs, linux device drivers, etc.
Scoring. The competition assigns a score to each tool's result on a verification task v. The category score of a tool is defined as the sum of scores for individual tasks in the category. In addition, medals are awarded to the three best tools in each category. The Overall SV-COMP score considers all verification tasks, with each constituent category score normalized by the number of tasks in it.
3.3 Tool Selection as a Machine Learning Problem
In this section, we first describe the setup of our portfolio solver TV, and then define the notion of the best-performing tool tbest predicted by TV.
Definitions. A verification task v = (f,p,type) is a triple of a source file /, the property p and property type type (e.g. reachability or safety). Function ExpAns : Tasks —> {true, false} maps verification task v G Tasks to true if the property p holds for / and to false otherwise. We identify each verification tool by a unique natural number t G N.
The result of a run of a tool t on a verification task v is a pair (ansttV, runtimettV}, where ansttV G {true, false, unknown} is the tool's answer whether the property holds, and runtimettV G M is the runtime of the tool in seconds. The expected answer for a task v is a boolean value ExpAns(v).
Machine Learning Data. We compute feature vectors from the metrics and the results of the competition as follows: for verification task v we define feature vector x(w)=(toarray_index(«), ■ ■ ■, mPTR(v), mST(v),... ,mhard(«), TOcfgbiocks(«), ■ ■ ■ ,mrecca,iis(v), type(v)), where the ml(v) are our metrics from Sect. 2 and type(v) G {0, 1} encodes if the property is reachability or memory safety.
The portfolio solver predicts a tool identifier t G {1,..., n}, which is a multi-class classification problem. We use a generalization of the one-vs-all classification to solve the problem. We define the labeling function Lt(v) for tool t and task v as follows:
1 if ansttV = ExpAns(v)
2 if ansttV = unknown
3 if anstv ^ unknown A anstv ExpAns(v)
572      Y. Demyanova et al.
where we treat opted-out categories as if the tool answered unknown for all of the category's verification tasks.
Formulation of the Machine Learning Problem. Given \Tools\ classification problems for a task v, the portfolio algorithm chooses a tool tbest as follows:
(choose(TCorr(v))   if TCorr(v) 0 choose(TUnk(v))    if TCorr(v) = 0 A TUnk(v) ^ 0 twinner if TCorr{v) = 0 A TUnk(v) = 0
where TCorr(v) = {t £ Tools | Mt(v) = 1}, TUnk(v) = {t £ Tools | Mt(v) = 2} and twmner is the winner of the competition, e.g. CPAchecker in SV-COMP'15. We now describe two alternative ways of implementing the operator choose.
1. "Success/Fail + Time": rVSuccFallTlme. We formulate \Tools\ additional
regression problems, where the predicted value is the runtime of the tool
runtimettV. We define choose(T) = arg min runtimettV.
teT
2. "Success/Fail + Probability": rVSuccFaUProb. We define the operator
choose(T) = arg maxPt„, where PttV is class probability estimate.
teT
In Table 4 we compare the two choose operators for category Overall in the setup of SV-COMP'14 according to 3 different criteria: the score, the percentage of correctly and incorrectly answered tasks and the place in the competition.
Configuration TVSuccFaUProb yields a higher score and number of correct answers with less runtime. We believe this is due to the tool runtimes varying in the range of 5 orders of magnitude (from tenth parts of a second to 15 min), which causes high error rates in the predicted runtime. We therefore use configuration
TVSuccFaUProb ^ jQ ^ followmg refer to it ag q-p
Table 4. Comparison of 2 formulations of TV.
Setting	Correct/Incorrect / Unknown answers, %	Score	Runtime, s	Place
rj-y-pS uccF ailT ime	92/3/6	1384	279859	1
rj-y-pSuccFailProb	93/1/5	1494	132688	1
The Weighting Function. We analyzed the results of SV-COMP'14 and observed, that the labeling function in the formulation of q--pSuccFmlProb [s highly imbalanced: the label which corresponds to incorrect answers, Lt(v) = 3, occurs in less than 4 % for all tools.
We therefore use SVM with weights, in accordance with the standard practice in machine learning. We note that we use the same weighting function for our
Empirical Software Metrics for Benchmarking of Verification Tools
573
experiments in the setup of SV-COMP'15 without any changes. Given a task v and tool t, we calculate the weighting function W as follows:
W(v, t) = Potentially) * Criticality(v) * Performance^, Cat(v)) * Speed(t, Cat(v))
- where Potential(v) = scoremax(v) — scoremin(v) is the difference of the maximal and minimal possible scores for task v. For example, in the setup of SV-COMP'14, if v is safe, then scoremax(v) = 2 and scoremin(v) = —8;
- Criticality(v)   = ——-——-----—— is inversely propor-
|{r G 1 ools | ansttV = hxpAnsyv)\\
tional (subject to a constant factor) to the probability of randomly choosing a tool which gives the expected answer;
7-i    /• cat_score(t, c)     . .
- Performance^, c) =--—-—-—- is the ratio of the scores of tool
cats cor e(tcbest, c)
t and the best in category c tool tchest, where given the score scoretv of tool t for task v, tcbest = arg max (catscorefa, c)) and cat_score(t, c) =
tt£Tools
Y, (scoretjV);
{veTasks\Cat(v)=c}
i/    \ ln(reLtime(t,c)) „ ,
- bpeeayt, c) = ——---—t——— is the relative difference of the orders
ln{rel-time(tcJst, c))
of magnitude of the fraction in total runtime of the time spent by tool t and the fastest in category c tool £c/st respectively, where relJbime(t, c) =
icatJ:ime(t,c) \ / I     Y    catJtime(ti,c)\, tc^st = arg min {catAimeiti, c))
^ 'I   ^tteTools ' UeTools
and catJuime(t, c) = Y runtimetv.
{veTasks\Cat(v)=c}
Implementation of TV. Finally, we discuss the details of the implementation of TV. We use the SVM ML algorithm with the RBF kernel and weights implemented in the LIBSVM library [13]. To find optimal parameters for a ML algorithm with respect to the error measure function, we do exhaustive search on the grid, as described in [22].
4   Experimental Results 4.1    SV-COMP 2014 vs. 2015
As described in Sect. 3.2, SV-COMP provides two metrics for comparing tools: score and medal counts. As the scoring policy has recently changed (the penalties for incorrect answers were increased) after a close jury vote [4], we are interested in how stable the scores are under different scoring policies. The following table gives the three top-scoring tools in Overall and their scores in SV-COMP'14 and '15, as well as the top-scorers of SV-COMP'14 if the 2015 scoring policy had been applied, and vice versa:
574      Y. Demyanova et al.
Competition	Scoring	1st place (score)	2nd place (score)	3rd place (score)
SV-COMP'14	Original	CBMC (3,501)	CPAchecker (2,987)	LLBMC (1,843)
	Like '15	CPAchecker (3,035)	CBMC (2,515)	LLBMC (2,004)
SV-COMP'15	Original	CPAchecker (4,889)	Ult. Aut. (2,301)	CBMC (1,731)
	Like '14	CPAchecker (5,537)	SMACK (4,120)	CBMC (3,481)
Discussion. Clearly, the scoring policy has a major impact on the competition results: If the '15 policy is applied to SV-COMP'14, the first and second placed tools switch ranks. SV-COMP'15, applying the previous year's policy has an even stronger effect: Ultimate Automizer loses its silver medal to SMACK, a tool originally not among the top three, and CBMC almost doubles its points.
Given that SV-COMP score and thus also medal counts are rather volatile, we introduce decisiveness-reliability plots (DR-plots) in the next section to complement our interpretation of the competition results.
4.2   Decisiveness-Reliability Plots
To better understand the competition results, we create scatter plots where each data point v = (c, i) represents a tool that gives c% correct answers and i% incorrect answers. Figures 1 and 3 show such plots based on the verification tasks in SV-COMP'14 and '15. Each data point marked by an unfilled circle O represents one competing tool. The rectilinear distance c + i from the origin gives a tool's decisiveness, i.e. the farther from the origin, the fewer times a tool reports "unknown". The angle enclosed by the horizontal axis and v gives a tool's (un)reliability, i.e. the wider the angle, the more often the tool gives incorrect answers. Thus, we call such plots decisiveness-reliability plots (DR-plots).
Discussion. Figures 1 and 3 show DR-plots for the verification tasks in SV-COMP'14 and'15. For 2014, all the tools are performing quite well on soundness: none of them gives more than 4% of incorrect answers. CPAchecker, ESBMC and CBMC are highly decisive tools, with more than 83 % correct answers.
In 2015 (Fig. 1) the number of verification tasks more than doubled, and there is more variety in the results: We see that very reliable tools (BLAST, SMACK, and CPAchecker) are limited in decisiveness - they report "unknown" in more than 40% of cases. The bounded model checkers CBMC and ESBMC are more decisive at the cost of giving up to 10% incorrect answers. We also give Overall SV-COMP scores (where applicable) in parentheses. Clearly, tools close together in the DR-plot not necessarily have similar scores because of the different score weights prescribed by the SV-COMP scoring policy.
Referring back to Figs. 1 and 3, we also show the theoretic strategies Tcat and Tvbs marked by a square □: Given a verification task v, Tcat selects the tool winning the corresponding competition category Cat(v). Tvbs is the virtual best solver (VBS) and selects the best performing tool per verification
Correctand incorrect answers by tools on SV-COMP'14, Overall SV-COMP score in parentheses
"l—r
-not shown for clarity & com--prehensibility
No® *\C 0^ ^
Q
7S%
10%    20%    30%    40%    50%    60%    70%    80%    90% 100% reports correct answer
Fig. 3. Decisiveness-reliability plot for SV-COMP'14.
576      Y. Demyanova et al.
(a) Overall SV-COMP score, runtime and medal counts for SV-COMP'14.
	blast	cbmc	cpa-che-cker	cpa-lien	esbmc	fbit	llbmc	ufo	TV	Tcat	Tvbs
Overall	468 2066	1292 4991	1235 1865	266 776	695 4024	666 898	853 978	735 381	1494 2211	1732 1310	1840 270
Medals	1/0/0	2/2/2	2/1/1	0/0/0	1/0/1	0/0/2	1/0/1	1/1/0	1/5/1	-	-
(b) Overall SV-COMP score, runtime and medal counts for SV-COMP'15.											
	blast	cascade	cbmc	cpa-che-cker	pre-da-torhp	smack	ulti-mate-kojak	ulcseq	TV	TCat	Tvbs
Overall	737 4546	806 5146	684 11936	2228 6288	389 96	1542 8727	1215 7979	273 12563	2511 6260	3231 4360	3768 1882
Medals	1/0/0	0/0/0	1/1/1	2/1/5	1/0/1	2/1/1	0/2/0	0/0/0	1/6/1	-	-
Fig. 4. Experimental results for the eight best competition participants in Overall, plus our portfolio TV and the idealized strategies Tcat, Tvts on random subsets of SV-COMP, given as arithmetic mean of 10 experiments on the resp. test sets testyear. The first row shows the Overall SV-COMP score and beneath it the runtime in minutes. We highlight the gold, silver, and bronze medal in dark gray, light gray and white+bold font, respectively. The second row shows the number of gold/silver/bronze medals won in individual categories.
task. Both strategies illustrate that combining tools can yield an almost perfect solver, with > 95% correct and 0% incorrect answers. (Note that these figures may give an overly optimistic picture - after all the benchmarks are supplied by the competition participants.) The results for Tvf,s compared to Tcat indicate that leveraging not just the category winner provides an advantage in both reliability and decisiveness. A useful portfolio would thus lie somewhere between CPAchecker, CBMC, Tcat, and Tvt,s, i.e. improve upon the decisiveness of constituent tools while minimizing the number of incorrect answers.
4.3   Evaluation of Our Portfolio Solver
We implemented the ML-based portfolio TV for SV-COMP'14 in our tool Ver-ifolio [5]. When competition results for SV-COMP'15 became available, we successfully evaluated the existing techniques on the new data. We present these results both in terms of the traditional metrics used by the competition (SV-COMP score and medals), and by its placement in DR-plots:
Setup. For our experiments we did not rebuild the infrastructure of SV-COMP, but use numeric results from the competition to compare our portfolio approach against other tools. Following a standard practice in ML [11], we randomly split the verification tasks of SV-COMP 'year into a training set trainyear and a test set testyear with a ratio of 60:40. We train our portfolio on trainyear and re-run the competition on testyear, with the portfolio participating as an additional
Empirical Software Metrics for Benchmarking of Verification Tools 577
tool. As the partitioning into training and test sets is randomized, we conduct the experiment 10 times and report the arithmetic mean of all figures. Figures 4a and b show the Overall SV-COMP scores, runtimes and medal counts. The DR-plots in Figs. 1 and 3 show the portfolio marked by a filled circle •.
Overhead of Feature Extraction. By construction, our portfolio incurs an overhead for feature extraction and prediction before actually executing the selected tool. We find this overhead to be negligible with a median time of Xfeatures = 0.5 seconds for feature extraction and xprediction = 0.5 seconds for prediction.
Discussion. First, we discuss our results in terms of Overall SV-COMP score and medals. The experimental results for SV-COMP'14 in Fig. 4a show that our portfolio overtakes the original Overall winner CBMC with 16% more points. It wins a total of seven medals (1/5/1 gold/silver/bronze) compared to CBMC's six medals (2/2/2). For SV-COMP'15 (Fig. 4b), our portfolio TV is again the strongest tool, collecting 13% more points than the original Overall winner CPAchecker. Both CPAchecker and TV collect 8 medals, with CPAchecker's 2/1/5 against TV's 1/6/1.
Second, we discuss the DR-plots in Figs. 1 and 3. Our portfolio TV positions itself between CPAchecker, CBMC and the theoretic strategies Tcat and Tvbs. Furthermore, TV falls halfway between the concrete tools and idealized strategies. We think this is a promising result, but also leaves room for future work. Here we invite the community to contribute further feature definitions, learning techniques, portfolio setups, etc. to enhance this approach.
Constituent Verifiers Employed by Our Portfolio. Our results could suggest that TV implements a trade-off between CPAchecker's conservative-and-sound and CBMC's decisive-but-sometimes-unsound approach. Contrarily, our experiments show that significantly more tools get selected by our portfolio solver. Additionally, we find that our approach is able to select domain-specific solvers: For example, in the Concurrency category, TV almost exclusively selects variants of CSeq, which translates concurrent programs into equivalent sequential ones.
Wrong Predictions. Finally, we investigate cases of wrong predictions made by the portolio solver, which are due to two reasons:
First, ML operates on the assumption that the behavior of a verification tool is the same for different verification tasks with the same or very similar metrics. However, sometimes this is not the case because tools are (1) unsound, e.g. SMACK in category Arrays, (2) buggy, e.g. BLAST in DeviceDrivers64, or (3) incomplete, e.g. CPAchecker in ECA.
Second, the data imbalances lead to the following bias in ML: For a verification tool that is correct most of the time, ML will prefer the error of predicting that the tool is correct (when in fact incorrect) over the error that a tool is incorrect (when in fact correct), i.e. "good" tools are predicted to be even "better".
578      Y. Demyanova et al.
References
1. Collective benchmark (cBench). http://ctuning.org/wiki/index.php/CTools: CBench. Accessed 6 Feb 2015
2. Competition on Software Verification 2014. http://sv-comp.sosy-lab.org/2014/. Accessed 6 Feb 2015
3. Competition on Software Verification 2015. http://sv-comp.sosy-lab.org/2015/. Accessed 6 Feb 2015
4. SV-COMP 2014 - Minutes. http://sv-comp.sosy-lab.org/2015/Minutes-2014.txt. Accessed 6 Feb 2015
5. Verifolio. http://forsyte.at/software/verifolio/. Accessed 11 May 2015
6. Alio, A.V., Sethi, R., Ullman, J.D.: Compilers: Princiles, Techniques, and Tools. Addison-Wesley, Reading (1986)
7. Baier, C, Tinelli, C. (eds.): TACAS 2015. LNCS, vol. 9035. Springer, Heidelberg (2015)
8. Beyer, D.: Status report on software verification. In: Abrahám, E., Havelund, K. (eds.) TACAS 2014 (ETAPS). LNCS, vol. 8413, pp. 373-388. Springer, Heidelberg (2014)
9. Beyer, D.: Software verification and verifiable witnesses. In: Baier, C, Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 401-416. Springer, Heidelberg (2015)
10. Beyer, D., Henzinger, T.A., Théoduloz, C: Configurable software verification: concretizing the convergence of model checking and program analysis. In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 504-518. Springer, Heidelberg (2007)
11. Bishop, CM.: Pattern Recognition and Machine Learning. Springer, New York (2006)
12. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Conference on Computational Learning Theory (COLT 1992), pp. 144-152 (1992)
13. Chang, O, Lin, O: LIBSVM: a library for support vector machines. ACM TIST 2(3), 27 (2011)
14. Clarke, E., Kroning, D., Lerda, F.: A tool for checking ANSI-C programs. In: Jensen, K., Podelski, A. (eds.) TACAS 2004. LNCS, vol. 2988, pp. 168-176. Springer, Heidelberg (2004)
15. Cortes, O, Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273-297 (1995)
16. Demyanova, Y., Veith, H., Zuleger, F.: On the concept of variable roles and its use in software analysis. In: Formal Methods in Computer-Aided Design (FMCAD 2013), pp. 226-230 (2013)
17. Dudka, K., Peringer, P., Vojnar, T.: Byte-precise verification of low-level list manipulation. In: Logozzo, F., Fáhndrich, M. (eds.) Static Analysis. LNCS, vol. 7935, pp. 215-237. Springer, Heidelberg (2013)
18. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T., Schneider, M.T., Ziller, S.: A portfolio solver for answer set programming: preliminary report. In: Delgrande, J.P., Faber, W. (eds.) LPNMR 2011. LNCS, vol. 6645, pp. 352-357. Springer, Heidelberg (2011)
19. Gomes, CP, Selman, B.: Algorithm portfolios. Artif. Intell. 126(1-2), 43-62 (2001)
20. Gurfinkel, A., Belov, A.: FrankenBit: bit-precise verification with many bits. In: Abrahám, E., Havelund, K. (eds.) TACAS 2014 (ETAPS). LNCS, vol. 8413, pp. 408-411. Springer, Heidelberg (2014)
Empirical Software Metrics for Benchmarking of Verification Tools
579
21. He, H., Garcia, E.A.: Learning from imbalanced data. Knowl. Data Eng. 21(9), 1263-1284 (2009)
22. Hsu, C.W., Chang, C.C., Lin, C.J., et al.: A practical guide to support vector classification (2003)
23. Huang, Y.M., Du, S.X.: Weighted support vector machine for classification with uneven training class sizes. Mach. Learn. Cybern. 7, 4365-4369 (2005)
24. Huberman, B.A., Lukose, R.M., Hogg, T.: An economics approach to hard computational problems. Science 275(5296), 51-54 (1997)
25. Kadioglu, S., Malitsky, Y., Sabharwal, A., Samulowitz, H., Sellmann, M.: Algorithm selection and scheduling. In: Lee, J. (ed.) CP 2011. LNCS, vol. 6876, pp. 454-469. Springer, Heidelberg (2011)
26. Lobjois, L., Lemaitre, M.: Branch and bound algorithm selection by performance prediction. In: Mostow, J., Rich, C. (eds.) National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 353-358 (1998)
27. Maratea, M., Pulina, L., Ricca, F.: The multi-engine ASP solver me-asp. In: del Cerro, L.F., Herzig, A., Mengin, J. (eds.) JELIA 2012. LNCS, vol. 7519, pp. 484-487. Springer, Heidelberg (2012)
28. O'Mahony, E., Hebrard, E., Holland, A., Nugent, C, OSullivan, B.: Using case-based reasoning in an algorithm portfolio for constraint solving. In: Irish Conference on Artificial Intelligence and Cognitive Science (2008)
29. Pani, T.: Loop patterns in C programs. Diploma Thesis (2014). http://forsyte.at/ static/people/pani/sloopy/thesis.pdf
30. Pulina, L., Tacchella, A.: A multi-engine solver for quantified boolean formulas. In: Bessiere, C. (ed.) CP 2007. LNCS, vol. 4741, pp. 574-589. Springer, Heidelberg (2007)
31. Pulina, L., Tacchella, A.: A self-adaptive multi-engine solver for quantified boolean formulas. Constraints 14(1), 80-116 (2009)
32. Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 65-118 (1976)
33. Roussel, O.: Description of ppfolio. http://www.cril.univ-artois.fr/~roussel/ ppfolio/solverl.pdf
34. Samulowitz, H., Memisevic, R.: Learning to solve QBF. In: Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 255-260 (2007)
35. Tulsian, V., Kanade, A., Kumar, R., Lai, A., Nori, A.V.: MUX: algorithm selection for software model checkers. In: Working Conference on Mining Software Repositories, pp. 132-141 (2014)
36. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975-1005 (2004)
37. Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K.: SATzilla: portfolio-based algorithm selection for SAT. J. Artif. Intell. Res. (JAIR) 32, 565-606 (2008)