CompSAST

SAST tools comparison

Authors:

Introduction

This project constitutes a modest comparative analysis of seven Static Application Security Testing (SAST) tools, evaluated across three security benchmarks of varying scales. The objective is to aggregate and scrutinize the outputs of SAST scanners, assess their vulnerability detection efficacy using MITRE CWE metrics, and conduct pairwise comparisons within feasible dimensions.

The following static analysis tools were selected:

The rationale for selecting these analyzers stems from the study’s focus on open-source instruments, with PVS-Studio incorporated experimentally for benchmarking purposes, as it is available gratis for open-source initiatives upon vendor request (source).

Scanning was performed on the subsequent open benchmarks, aligned with the programming languages comprehensively supported by each tool:

These benchmarks were deliberately chosen due to their heterogeneity—not merely in target programming languages, but also in corpus size (file count and lines of code):

Preliminary aggregate scanning results for each analyzer are depicted in the figure below:

Total metrics summary

Here on wich test suite was each SAST tool tested:

Test suites coverage

Characteristics of each analyzer with references:

sast chars

Source spreadsheet is on the link: LINK

IAMeter

by Ilya-Linh Nguen (GitHub: ilyalinhnguyen)

Scope of Scanning

Three repositories from the Positive Technologies organization were selected:

They are written in Go, Java, and PHP respectively.

IAMeter_Go
IAMeter_Java
IAMeter_PHP

Analyzer Usage

Each repository used its own analyzer scope. This is because some analyzers do not support every programming language used in the benchmark.

The table below shows which analyzer scanned each project, according to the rows in total.csv / CompSAST, and where a tool was not included in the set or was intended for only one language.

Analyzer IAMeter_Go IAMeter_Java IAMeter_PHP
Semgrep yes yes yes
OpenGrep yes yes yes
CodeQL yes yes no
SonarQube yes yes yes
PMD no* yes no*
PVS-Studio no yes no
Joern (joern-scan) yes yes yes

All analyzers except PVS-Studio scanned each project as a whole. Due to the specifics of its operation, and to obtain more accurate results, PVS-Studio scanned the project file by file.

Analysis Process and Helper Scripts

Semgrep

This analyzer was run with the following command:

semgrep scan --config auto --sarif --output=results.sarif
OpenGrep

This analyzer was run with the following command:

opengrep scan --sarif-output=output.sarif
CodeQL

To scan with this analyzer, a database must first be created:

codeql database create <db-name> --language=<lang> --source-root <path-to-root>

Then the analysis is run:

codeql database analyze <db-name> \
  --format=sarifv2.1.0 \
  --output=<output-name>.sarif \
  codeql/java-queries:codeql-suites/java-security-and-quality.qls # Add the rule set for the specific programming language here
SonarQube

First, SonarQube had to be deployed locally. This was done with Docker:

docker run -d --name sonarqube -p 9000:9000 sonarqube:community

Then projects for the repositories were created in SonarQube. After that, sonar-scanner had to be downloaded and the analysis was run:

cd project

sonar-scanner \
  -Dsonar.projectKey=<project-key> \
  -Dsonar.sources=. \
  -Dsonar.host.url=http://localhost:9000 \
  -Dsonar.token=<sonar-token>

After that, issues were exported from the SonarQube server to SARIF using the compsast-artifacts/sonar_issues_to_sarif.py script. The script works as follows:

Example:

export SONAR_TOKEN=<sonar-token>
python3 compsast-artifacts/sonar_issues_to_sarif.py \
  --host http://localhost:9000 \
  --project-key <project-key> \
  -o sonarqube.sarif
PMD

This analyzer was run with the following command:

pmd check --dir ./src --rulesets category/java/security.xml  --format sarif --report-file pmd-report.sarif
PVS-Studio

pvs_iameter_java_per_file.sh sequentially runs PVS-Studio Java on each file in IAMeter_Java/src/main/java/iameter/*.java, writes a separate JSON report for each file to pvs-by-file/, and then calls merge_pvs_json_reports.py, which merges these reports into a single pvs_project_report_per_file.json. After that, plog-converter from the PVS distribution converts the final JSON report to SARIF:

plog-converter -t sarif -o pvs-iameter.sarif pvs_project_report_per_file.json
Joern

Scanning was performed by the compsast-artifacts/joern_iameter_all.sh script. It runs joern-scan three times from the repository root: separately for IAMeter_Go, IAMeter_Java, and IAMeter_PHP; each run contains files from only one language. Before the Java run, mvn -q compile is executed. The output of each run is written next to the project as IAMeter_*/joern-scan.txt. Then joern_scan_txt_to_sarif.py is executed to build IAMeter_*/joern-scan.sarif. Language keys are set through the JOERN_LANG_* variables or default to golang, java, and php.

Scan Results

The table below summarizes which CWE categories the tools found after SARIF matching. The ”-“ marker means that the analyzer was not run for the language of that repository.

Analyzer IAMeter_Java IAMeter_Go IAMeter_PHP
Semgrep CWE-79 and CWE-611 CWE-79 not found
OpenGrep CWE-79 and CWE-611 CWE-79 not found
CodeQL CWE-79 not found -
SonarQube not found - not found
PMD not found - -
PVS-Studio not found - -
Joern not found not found not found

Total statistics:

IAMeter Golang (source):

iameter-go

IAMeter Java (source):

iameter-java

IAMeter PHP (soure):

iameter-php

All artifacts on GitHub: LINK

OWASP Benchmarks

by Peter Zavadskii (Github: Abraham14711)

1. Scope of Scanning

The benchmark dataset consists of a curated set of vulnerable projects used for evaluating static analysis tools. Based on the archive structure:

2. Tools Used

Analyzers

The following static analysis tools were used:

OWASP BenchmarkJava

The OWASP BenchmarkJava dataset is a mature and widely used benchmark for evaluating static application security testing (SAST) tools. In version 1.2, it contains approximately 2,740 test cases, each implemented as an individual Java servlet. Earlier versions contained significantly more test cases (over 20,000), but the current version focuses on a curated and balanced subset.

Each test case represents a single vulnerability instance (or a false positive case) associated with a specific CWE category. The dataset is structured as a full web application, with the main source code located in the src/ directory. It also includes supporting scripts, tooling, and a ground truth file (expectedresults-1.2.csv) used for evaluation.

The project is primarily written in Java, with some HTML components used for web interaction. In practice, this results in thousands of Java classes and a codebase that reaches tens of thousands of lines of code.

The dataset covers a focused set of common web vulnerabilities. The main CWE categories include:

Overall, BenchmarkJava provides strong coverage of classical web application vulnerabilities and is considered a standard dataset for SAST benchmarking.


OWASP BenchmarkPython

The OWASP BenchmarkPython dataset is a newer and less mature benchmark compared to its Java counterpart. It contains approximately 1,230 test cases and is currently considered a preliminary version (v0.1).

Like the Java version, each test case represents a single vulnerability or a negative (non-vulnerable) example. The dataset is also structured as a web application, but it is significantly smaller in size—roughly two to three times smaller than BenchmarkJava in terms of both test cases and overall code volume.

The project is written primarily in Python and follows a similar philosophy: synthetic, well-isolated test cases designed for precise evaluation of static analyzers.

BenchmarkPython covers a broader and slightly more modern range of CWE categories compared to the Java dataset. These include:

Notably, this dataset includes vulnerability types that are less represented in the Java benchmark, such as deserialization issues and XXE.


Comparison and Key Takeaways

BenchmarkJava is larger, more mature, and more widely adopted. It provides a stable and well-understood baseline for evaluating static analysis tools, especially for traditional web vulnerabilities.

BenchmarkPython, while smaller and less mature, introduces a broader set of CWE categories and reflects more modern vulnerability patterns. However, its limited size and early-stage development make it less comprehensive for large-scale evaluation.

Both datasets share the same core design principles:

These characteristics make them particularly suitable for quantitative evaluation of static analysis tools using metrics such as precision, recall, and F1-score.

Report:

3. Scanning Methodology

All the project were scanned by Whole-project analysis method. It is the only available aproach for owasp benchmarks since these projects are too huge to scan file-by-file manually.

Reproducibility Scripts (source on GitHub)

The my analysis includes custom scripts:

These scripts:

  1. Parse raw tool outputs
  2. Normalize findings
  3. Map findings to CWE IDs
  4. Compare results against ground truth
  5. Compute evaluation metrics

Results :

Java Benchmark (source):

alt text

Python Benchmark (source):

alt text

NIST Juliet C#

by Arsen Galiev (Github: projacktor)

Test suite source: https://samate.nist.gov/SARD/test-suites/110

Scope of Scanning

Benchmark NIST Juliet C# has a following characteristics:

Types of CWEs:

cwe_id cwe_name
CWE-15 External Control of System or Configuration Setting
CWE-23 Relative Path Traversal
CWE-36 Absolute Path Traversal
CWE-78 OS Command Injection
CWE-80 XSS
CWE-81 XSS Error Message
CWE-83 XSS Attribute
CWE-89 SQL Injection
CWE-90 LDAP Injection
CWE-94 Improper Control of Generation of Code
CWE-113 HTTP Response Splitting
CWE-114 Process Control
CWE-117 Improper Output Neutralization for Logs
CWE-129 Improper Validation of Array Index
CWE-134 Externally Controlled Format String
CWE-190 Integer Overflow
CWE-191 Integer Underflow
CWE-193 Off by One Error
CWE-197 Numeric Truncation Error
CWE-209 Information Leak Error
CWE-226 Sensitive Information Uncleared Before Release
CWE-248 Uncaught Exception
CWE-252 Unchecked Return Value
CWE-253 Incorrect Check of Function Return Value
CWE-256 Unprotected Storage of Credentials
CWE-259 Hard Coded Password
CWE-261 Weak Cryptography for Passwords
CWE-284 Improper Access Control
CWE-313 Cleartext Storage in a File or on Disk
CWE-314 Cleartext Storage in the Registry
CWE-315 Cleartext Storage in Cookie
CWE-319 Cleartext Tx Sensitive Info
CWE-321 Hard Coded Cryptographic Key
CWE-325 Missing Required Cryptographic Step
CWE-327 Use Broken Crypto
CWE-328 Reversible One Way Hash
CWE-329 Not Using Random IV with CBC Mode
CWE-336 Same Seed in PRNG
CWE-338 Weak PRNG
CWE-350 Reliance on Reverse DNS Resolution for Security Action
CWE-366 Race Condition within a Thread
CWE-369 Divide by Zero
CWE-378 Temporary File Creation With Insecure Perms
CWE-379 Temporary File Creation in Insecure Dir
CWE-390 Error Without Action
CWE-395 Catch NullPointerException
CWE-396 Catch Generic Exception
CWE-397 Throw Generic Exception
CWE-398 Code Quality
CWE-400 Uncontrolled Resource Consumption
CWE-404 Improper Resource Shutdown
CWE-426 Untrusted Search Path
CWE-427 Uncontrolled Search Path Element
CWE-440 Expected Behavior Violation
CWE-459 Incomplete Cleanup
CWE-470 Unsafe Reflection
CWE-476 NULL Pointer Dereference
CWE-477 Obsolete Functions
CWE-478 Missing Default Case in Switch
CWE-481 Assigning Instead of Comparing
CWE-482 Comparing Instead of Assigning
CWE-483 Incorrect Block Delimitation
CWE-486 Compare Classes by Name
CWE-506 Embedded Malicious Code
CWE-510 Trapdoor
CWE-511 Logic Time Bomb
CWE-523 Unprotected Cred Transport
CWE-526 Info Exposure Environment Variables
CWE-532 Inclusion of Sensitive Info in Log
CWE-535 Info Exposure Shell Error
CWE-539 Information Exposure Through Persistent Cookie
CWE-546 Suspicious Comment
CWE-549 Missing Password Masking
CWE-561 Dead Code
CWE-563 Assign to Variable Without Use
CWE-566 Authorization Bypass Through SQL Primary
CWE-570 Expression Always False
CWE-571 Expression Always True
CWE-582 Array Public Readonly Static
CWE-598 Information Exposure QueryString
CWE-601 Open Redirect
CWE-605 Multiple Binds Same Port
CWE-606 Unchecked Loop Condition
CWE-609 Double Checked Locking
CWE-613 Insufficient Session Expiration
CWE-614 Sensitive Cookie Without Secure
CWE-615 Info Exposure by Comment
CWE-617 Reachable Assertion
CWE-643 Xpath Injection
CWE-667 Improper Locking
CWE-674 Uncontrolled Recursion
CWE-675 Duplicate Operations on Resource
CWE-681 Incorrect Conversion Between Numeric Types
CWE-690 NULL Deref From Return
CWE-698 Execution After Redirect
CWE-759 Unsalted One Way Hash
CWE-760 Predictable Salt One Way Hash
CWE-764 Multiple Locks
CWE-765 Multiple Unlocks
CWE-772 Missing Release of Resource
CWE-775 Missing Release of File Descriptor or Handle
CWE-789 Uncontrolled Mem Alloc
CWE-832 Unlock Not Locked
CWE-833 Deadlock
CWE-835 Infinite Loop

Which tools were/not used

The benchmark was applied for the following list of SAST tools:

And by the reasons below was not run by SonarQube, PMD and Joern:

Terms:

Scan process

Benchmarking

The logic of metrics counting is the following:

1) Juliet has a set of methods in each file:

Bad()
BadSink()
GoodG2B()
GoodB2G()
GoodG2BSink()
GoodB2GSink()

Bad* and Good* methods are considered bad and good regions respectively. The Good() dispatch method is not considered a separate region to avoid bloating the TN.

2) In each method we have types of comments or marked code lines:

/* FLAW */
/* POTENTIAL FLAW */
/* FIX */

The benchmark script in --no-cwe-aware mode finds marked code lines, distinguish the marker and the determines how to count the finding:

Therefore, in --no-cwe-aware the script counts metrics only according to the tools’ findings and not checks whether the tools made a correct finding.

The default mode of benchmark_sarif.py is --cwe-aware. In this mode, the line-aware matching described above is still used, but a finding is counted only if its CWE corresponds to the expected CWE of the Juliet testcase.

The script determines the CWE of a finding in the following order:

1) If --rule-cwe-map is provided, the script uses the explicit mapping from rule id to CWE:

rule_id,cwe
csharp.lang.security.sql-injection,CWE89

2) If the SARIF file contains rule metadata, the script reads CWE values from:

runs[].tool.driver.rules[].properties.tags

This is required for tools such as CodeQL and PVS-Studio. CodeQL may store rule CWE values as tags such as external/cwe/cwe-089; PVS-Studio may store rule CWE values as tags such as external/cwe/cwe-571, while the individual result contains only a rule id such as V3022.

3) If the CWE is present directly in the rule id or in the finding message, the script uses it as a fallback.

CWE identifiers are normalized before comparison. For example, CWE089, CWE-089 and CWE89 are treated as the same CWE.

In --cwe-aware mode the classification rules are:

For example, if a CodeQL finding cs/web/cookie-secure-not-set has CWE319 and appears near a Juliet CWE113 HTTP Response Splitting marker, it is classified as out_of_scope, not as TP or FP for CWE113.

This mode is used for strict comparison of tools when reliable CWE information is available. It is the recommended mode for CodeQL and PVS-Studio. For tools without reliable CWE mapping, such as the tested Semgrep/OpenGrep configuration, --no-cwe-aware can be used to evaluate only line-aware marker hits.

Semgrep

For testing Semgrep its Docker version were used:

docker run --rm -v "$PWD/src:/src" returntocorp/semgrep semgrep scan --sarif -o /src/semgrep.sarif --config auto /src

scanning all the testcases as a one project since Semgrep scanning allows this operation and it is more approximate to the real CI process in production.

After obtainig scanning result in SARIF format, the benchmark evaluating script were used source:

python3 benchmark_sarif.py \
  --src src/testcases \
  --sarif sarif/semgrep.sarif \
  --tool semgrep \
  --out-dir results/semgrep_eval \
  --no-cwe-aware

Opengrep

Test suite was scanned by native Opengrep verion 1.20.0 with the following command:

cd ./src/testcases
opengrep scan --sarif -o ./opengrep.sarif ./

SARIF output after scan was measured similarly as the Semgrep’s one:

python3 benchmark_sarif.py \
  --src src/testcases \
  --sarif sarif/opengrep.sarif \
  --tool opengrep \
  --out-dir results/opengrep_eval \
  --no-cwe-aware
Methodological limitation for Semgrep and OpenGrep

Semgrep and OpenGrep results were evaluated in --no-cwe-aware mode because the tested rule sets did not provide a reliable rule-to-CWE mapping.

Therefore, their metrics should be interpreted as line-aware marker matching results rather than strict CWE-aware vulnerability detection results. In this mode, a finding is counted if it appears near a Juliet FLAW, POTENTIAL FLAW, or FIX marker, but the benchmark does not verify whether the finding corresponds to the expected CWE of the testcase.

As a result, Semgrep and OpenGrep scores are not directly comparable with CodeQL and PVS-Studio scores evaluated in --cwe-aware mode. They should be treated as approximate results until a validated rule_id -> CWE mapping is provided and the benchmark is rerun in CWE-aware mode.

CodeQL

CodeQL and PVS-Studio for dotnet need project builds for analysis, therefore both of them were scanned on Windows 11 with Visual Studio installed.

CodeQL CLI version 2.25.2 was manually installed with the qlset from GitHub sources. The scanning scripts for per CWE-folder analysis below:

codeql database create codeql-db-csharp ``
  --language=csharp ``
  --source-root . ``
  --command='powershell -NoProfile -ExecutionPolicy Bypass -File .\script.ps1'

script.ps1:

msbuild .\src\testcasesupport\TestCaseSupport.sln /t:Restore,Rebuild /p:Configuration=Debug
if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE }

Get-ChildItem .\src\testcases -Recurse -Filter *.sln | ForEach-Object {
    msbuild $_.FullName /t:Restore,Rebuild /p:Configuration=Debug
    if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE }
}

analysis process:

codeql database analyze .\codeql-db-csharp2 codeqlcsharp-security-extended.qls/csharp-queries:codeql-suites/csharp-security-extended.qls --format=sarif-latest --output=.\sarif\codeql.sarif --threads=4 --ram=8000

Getting the SARIF the following setup for CodeQL:

python3 benchmark_sarif.py \
    --src src/testcases \
    --sarif sarif/codeql-security.sarif \
    --tool codeql \
    --out-dir results/codeql_eval

PVS-Studio

PVS-Studio for dotnet v. 7.42.105479.2635 was installed, activated license and following scripts were used for per CWE-folder analysis:

# analysis
$repo = "<path>\2020-08-01-juliet-test-suite-for-csharp-v1-3"
$pvs = "${env:ProgramFiles(x86)}\PVS-Studio\PVS-Studio_Cmd.exe"
$conv = "${env:ProgramFiles(x86)}\PVS-Studio\PlogConverter.exe"
$out = "$repo\_pvs"

New-Item -ItemType Directory -Force $out | Out-Null

Get-ChildItem "$repo\src\testcases" -Filter *.sln -Recurse | ForEach-Object {
    $sln = $_.FullName
    $id = $sln.Substring($repo.Length).TrimStart('\') -replace '[\\/:*?"<>| ]', '_'
    $plog = Join-Path $out "$id.plog"

    Write-Host "Analyzing $sln"
    & $pvs -t "$sln" -o "$plog"
}

# all-in one writting
$plogs = Get-ChildItem "$out" -Filter *.plog -Recurse |
  Where-Object { $_.Name -notmatch '^pvs-all\.' } |
  ForEach-Object { $_.FullName }

& $conv `
  -t Plog `
  -o "$out" `
  -r "$repo" `
  -m CWE,OWASP `
  -n "pvs-all" `
  @plogs

& $conv `
  -t Sarif,Csv `
  -o "$out" `
  -r "$repo" `
  -m CWE,OWASP `
  -n "pvs-all" `
  "$out\pvs-all.plog"

Results

Summary for all SAST tools:

tool TP FP TN FN precision recall specificity F1 total vulnerable points total safe points total findings unmatched FP findings unknown findings out-of-scope findings
CodeQL 1,433 299 127,778 54,958 0.827367 0.025412 0.997665 0.049309 56,391 127,817 4,217 260 3 2,482
OpenGrep 671 4,673 126,961 55,720 0.125561 0.011899 0.964500 0.021738 56,391 127,817 5,344 3,817 0 0
Semgrep 671 4,673 126,961 55,720 0.125561 0.011899 0.964500 0.021738 56,391 127,817 5,344 3,817 0 0
PVS-Studio 1,324 217 127,710 55,067 0.859182 0.023479 0.998304 0.045709 56,391 127,817 54,437 110 6 52,870

Key observations:

Juliet CWE coverage

tool CWE groups with TP evaluated CWE groups
CodeQL 10 105
PVS-Studio 10 105
Semgrep 3 105
OpenGrep 3 105

More detailed statistics for each CWE availble at this google sheet

All artifacts (benchmark scripts .csv outputs, .sarif scan outputs, script source code) can be found here: repo

Overall, the results show that the tested tools are highly conservative on this benchmark configuration. CodeQL and PVS-Studio provide better precision, but their recall remains low because only a small subset of Juliet CWE classes is matched by the enabled rules and CWE-aware scoring. Semgrep and OpenGrep produce identical results and should be treated as line-aware marker matching baselines rather than strict CWE-aware SAST results.

NIST Juliet Java

by Kirill Nosov (GitHub InnoNodo)

1. Scanned Dataset

Dataset: NIST Juliet Test Suite for Java (JDK 8) Number of files: 23,721 .java files Lines of code: ~5,100,000 Programming language: Java (JDK 8) Number of CWE directories: 106

Included CWEs:

2. Tools Used for Scanning

Analyzers that were used:

  1. OpenGrep
    • Type: static text analyzer based on regular expressions.
    • Reason for selection: supports any Java version and is convenient for finding specific CWE patterns in source code.
  2. Semgrep
    • Type: semantic static analyzer that uses AST-level rules.
    • Reason for selection: supports Java, allows rules to be quickly customized for CWE categories, and works with JDK 8.
  3. PMD
    • Type: classic static code analyzer for Java.
    • Reason for selection: supports Java 8, can scan the whole project or individual classes, and includes many standard rules for detecting bugs and vulnerabilities.
  4. CodeQL
    • Type: powerful analyzer based on queries against a code database.
    • Reason for selection: supports Java and provides a flexible query system.

Analyzers that could not be used:

  1. PVS-Studio
    • Reason: it does not work with sources written for JDK 8, so scanning was skipped.
  2. SonarQube Scanner
    • Reason: it does not support JDK 8. During scanning attempts, the server encountered an Out of Memory error; several machines with AMD Ryzen 5500U and AMD Ryzen 5600H CPUs and 16 GiB RAM could not process the scan results and shut down during processing.
  3. Joern
    • Reason: it does not support versions below JDK 11.

Selection summary:

3. Scanning Methodology

Four analyzers were used for the Juliet Java Suite: PMD, Semgrep, OpenGrep, and CodeQL. Each analyzer scanned the project by separate CWE directories and generated SARIF reports for subsequent analysis.

Semgrep (scripts/scan_semgrep.sh), OpenGrep (scripts/scan_opengrep.sh), and CodeQL (scripts/scan_codeql.sh) work similarly:

Scanning Scripts

3.1 PMD (scripts/scan_pmd.sh)
#!/bin/bash
# Script for scanning the Juliet Java Benchmark with PMD
# Usage: ./scan_pmd.sh [CWE_NUMBER]

set -e

BENCHMARK_DIR="/mnt/c/Users/USER/Downloads/2017-10-01-juliet-test-suite-for-java-v1-3/Java"
SRC_DIR="$BENCHMARK_DIR/src/testcases"
SARIF_DIR="$BENCHMARK_DIR/sarif/pmd"
PMD_BIN="$BENCHMARK_DIR/../tools/pmd-bin-6.55.0"

mkdir -p "$SARIF_DIR"

CWE=${1:-}

if [ -n "$CWE" ]; then
    CWES=("$CWE")
else
    CWES=$(ls -d "$SRC_DIR"/CWE* 2>/dev/null | xargs -n1 basename \
        | grep -v "\.war$" | sed 's/CWE//' | sed 's/_.*//' | sort -u)
fi

echo "Scanning with PMD..."
for cwe in $CWES; do
    echo "=== Scanning CWE$cwe ==="
    dir=$(ls -d "$SRC_DIR"/CWE${cwe}_* 2>/dev/null | grep -v "\.war$" | head -1)

    if [ -z "$dir" ]; then
        echo "CWE$cwe not found"
        continue
    fi

    output="$SARIF_DIR/PMD__${cwe}__results.sarif"

    if [ -f "$output" ]; then
        echo "Already exists: $output"
        continue
    fi

    cd "$PMD_BIN"
    ./run.sh pmd -d "$dir" -R category/java/security.xml -f sarif \
        -report-file "$output" 2>/dev/null || true
    cd -

    if [ -f "$output" ]; then
        results=$(python3 -c "import json; print(len(json.load(open('$output')).get('runs',[{}])[0].get('results',[])))" \
            2>/dev/null || echo "0")
        echo "  Results: $results"
    fi
done

echo "Done. SARIF files: $SARIF_DIR"
3.2 OpenGrep (scripts/scan_opengrep.sh)
#!/bin/bash
# Script for scanning the Juliet Java Benchmark with OpenGrep
# Usage: ./scan_opengrep.sh [CWE_NUMBER]

set -e

BENCHMARK_DIR="/mnt/c/Users/USER/Downloads/2017-10-01-juliet-test-suite-for-java-v1-3/Java"
SRC_DIR="$BENCHMARK_DIR/src/testcases"
SARIF_DIR="$BENCHMARK_DIR/sarif/opengrep"
OPENGREP_BIN="$BENCHMARK_DIR/../tools/opengrep"

mkdir -p "$SARIF_DIR"

CWE=${1:-}
if [ -n "$CWE" ]; then
    CWES=("$CWE")
else
    CWES=$(ls -d "$SRC_DIR"/CWE* 2>/dev/null | xargs -n1 basename \
        | grep -v "\.war$" | sed 's/CWE//' | sed 's/_.*//' | sort -u)
fi

echo "Scanning with OpenGrep..."
for cwe in $CWES; do
    echo "=== Scanning CWE$cwe ==="
    dir=$(ls -d "$SRC_DIR"/CWE${cwe}_* 2>/dev/null | grep -v "\.war$" | head -1)

    if [ -z "$dir" ]; then
        echo "CWE$cwe not found"
        continue
    fi

    output="$SARIF_DIR/OpenGrep__${cwe}__results.sarif"
    if [ -f "$output" ]; then
        echo "Already exists: $output"
        continue
    fi

    ./opengrep scan --source "$dir" --output "$output" --format sarif 2>/dev/null || true

    if [ -f "$output" ]; then
        results=$(python3 -c "import json; print(len(json.load(open('$output')).get('runs',[{}])[0].get('results',[])))" \
            2>/dev/null || echo "0")
        echo "  Results: $results"
    fi
done

echo "Done. SARIF files: $SARIF_DIR"
3.3 CodeQL (scripts/scan_codeql.sh)
# Wrote scripts/generate_codeql_sarif.sh
#!/bin/bash
# Generate CodeQL SARIF files from Juliet test suite
# Usage: ./generate_codeql_sarif.sh [CWE_NUMBER]
set -e
CODEQL="/home/nodo/codeql/codeql/codeql"
SRC_ROOT="/mnt/c/Users/USER/Downloads/2017-10-01-juliet-test-suite-for-java-v1-3/Java"
SRC_DIR="$SRC_ROOT/src/testcases"
SARIF_DIR="$SRC_ROOT/sarif/codeql"
mkdir -p "$SARIF_DIR"
CWE=${1:-}
if [ -n "$CWE" ]; then
    CWES=("$CWE")
else
    CWES=$(ls -d "$SRC_DIR"/CWE* 2>/dev/null | grep -v ".war$" | xargs -n1 basename | sed 's/CWE//' | sed 's/_.*//' | sort -u)
fi
echo "Scanning with CodeQL (autobuild)..."
echo "CWE list: ${CWES[*]}"
for cwe in $CWES; do
    echo "=== CWE$cwe ==="

    dir=$(ls -d "$SRC_DIR"/CWE${cwe}_* 2>/dev/null | grep -v ".war$" | head -1)

    if [ -z "$dir" ]; then
        echo "CWE$cwe not found"
        continue
    fi

    output="$SARIF_DIR/CodeQL__${cwe}__autobuild.sarif"

    if [ -f "$output" ]; then
        results=$(python3 -c "import json; print(len(json.load(open('$output')).get('runs',[{}])[0].get('results',[]))))" 2>/dev/null || echo "0")
        if [ "$results" != "0" ]; then
            echo "Already exists: $output ($results results)"
            continue
        fi
    fi

    db_path="/tmp/codeql_db_$cwe"
    rm -rf "$db_path"

    $CODEQL database create --language=java --source-root="$dir" --build-mode=autobuild "$db_path" 2>&1 | tail -2

    if [ -d "$db_path" ]; then
        $CODEQL database analyze "$db_path" --format=sarif-latest --output="$output" --ram=4096 2>&1 | tail -1

        if [ -f "$output" ]; then
            results=$(python3 -c "import json; print(len(json.load(open('$output')).get('runs',[{}])[0].get('results',[]))))" 2>/dev/null || echo "0")
            echo "  Results: $results"
        fi
    fi

    rm -rf "$db_path"
done
echo "Done. SARIF files: $SARIF_DIR"
3.4 Semgrep
#!/bin/bash
set -e
BENCHMARK_DIR="/mnt/c/.../JulietJava"
SRC_DIR="$BENCHMARK_DIR/src/testcases"
SARIF_DIR="$BENCHMARK_DIR/sarif/semgrep"
mkdir -p "$SARIF_DIR"
CWE=${1:-}
CWES=$( [ -n "$CWE" ] && echo "$CWE" || ls -d "$SRC_DIR"/CWE* | xargs -n1 basename | sed 's/CWE.*//' | sort -u)
for cwe in $CWES; do
    dir=$(ls -d "$SRC_DIR"/CWE${cwe}_* | head -1)
    output="$SARIF_DIR/Semgrep__${cwe}__results.sarif"
    semgrep --lang java --config=auto "$dir" --Sarif --output="$output" 2>/dev/null || true
done
3.5 Result Analysis (score_juliet.py)
#!/usr/bin/env python3
"""
Score SARIF findings against the NIST Juliet Java test suite.

Uses FLAW/FIX comments in source code to determine ground truth.
"""

import argparse
import csv
import json
import os
import re
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List, Optional

METHOD_RE = re.compile(
    r"^\s*(?:public|private|protected)?\s+"
    r"(?:(?:static|final|abstract|synchronized)\s+)*"
    r"[\w<>\[\],\s]+\s+"
    r"(?P<name>[A-Za-z_]\w*)\s*\([^)]*\)\s*\{?\s*$"
)

CWE_RE = re.compile(r"\bCWE[-_ ]?(?P<num>\d+)\b", re.IGNORECASE)
FILENAME_CWE_RE = re.compile(r"\bCWE(?P<num>\d+)_(?P<name>[^/\\.]+)")

@dataclass
class Finding:
    file: str
    line: int
    rule_id: str
    message: str
    classification: str = "unknown"
    expected_cwe: str = ""

@dataclass
class ExpectedPoint:
    file: str
    line: int
    cwe_id: str
    cwe_name: str
    point_kind: str  # "vulnerable" or "safe"
    comment: str
    status: str = "UNSEEN"
    findings: List[Finding] = field(default_factory=list)

# --- Normalization and parsing functions ---
def normalize_path(path: str, src_root: Path) -> str:
    cleaned = path.replace("\\", "/")
    cleaned = re.sub(r"^[a-zA-Z]:", "", cleaned)
    cleaned = cleaned.lstrip("/")
    marker = "src/testcases/"
    if marker in cleaned:
        cleaned = cleaned.split(marker, 1)[1]
    return cleaned

def parse_cwe_from_file(rel_path: str) -> Optional[tuple]:
    match = FILENAME_CWE_RE.search(rel_path)
    if match:
        return f"CWE{match.group('num')}", match.group("name").replace("_", " ")
    return None

def build_expected_points_from_cwe(src_root: Path, cwe_nums: List[int]) -> List[ExpectedPoint]:
    """Build expected points only for specific CWEs."""
    points = []
    cwe_dirs = [d for d in src_root.iterdir() if d.is_dir() and d.name.startswith("CWE")]

    for cwe_dir in cwe_dirs:
        cwe_match = re.search(r'CWE(\d+)', cwe_dir.name)
        cwe_num = int(cwe_match.group(1)) if cwe_match else None
        if cwe_nums and cwe_num not in cwe_nums:
            continue

        seen = set()
        for path in cwe_dir.rglob("*.java"):
            try:
                rel_path = path.relative_to(src_root).as_posix()
            except ValueError:
                continue

            lines = path.read_text(encoding="utf-8", errors="replace").splitlines()

            parsed = parse_cwe_from_file(rel_path)
            cwe_id = parsed[0] if parsed else ""
            cwe_name = parsed[1] if parsed else ""

            for i, line in enumerate(lines, 1):
                has_flaw = "FLAW" in line and "FIX" not in line and "/*" in line
                has_fix = "FIX" in line and "/*" in line

                if has_flaw:
                    kind = "vulnerable"
                elif has_fix:
                    kind = "safe"
                else:
                    continue

                point_id = f"{rel_path}:{i}"
                if point_id in seen:
                    continue
                seen.add(point_id)

                points.append(ExpectedPoint(
                    file=rel_path,
                    line=i,
                    cwe_id=cwe_id,
                    cwe_name=cwe_name,
                    point_kind=kind,
                    comment=""
                ))

    return points

# --- SARIF loading and classification ---
def load_sarif(path: Path, src_root: Path) -> List[Finding]:
    data = json.loads(path.read_text(encoding="utf-8"))
    findings = []

    for run in data.get("runs", []):
        for result in run.get("results", []):
            locations = result.get("locations") or []
            if not locations:
                continue

            physical = locations[0].get("physicalLocation", {})
            artifact = physical.get("artifactLocation", {})
            region = physical.get("region", {})
            uri = artifact.get("uri", "")
            line = int(region.get("startLine") or 0)
            rule_id = result.get("ruleId", "")
            message = (result.get("message", {})).get("text", "")

            findings.append(Finding(
                file=normalize_path(uri, src_root),
                line=line,
                rule_id=rule_id,
                message=message[:200]
            ))

    return findings

def classify(findings: List[Finding], points: List[ExpectedPoint], window: int = 5) -> None:
    points_by_file = {}
    for point in points:
        points_by_file.setdefault(point.file, []).append(point)

    for finding in findings:
        file_points = points_by_file.get(finding.file, [])
        for point in file_points:
            if point.point_kind == "vulnerable" and abs(point.line - finding.line) <= window:
                finding.classification = "matched_tp"
                finding.expected_cwe = point.cwe_id
                point.findings.append(finding)
                point.status = "TP"
                break
        if finding.classification != "unknown":
            continue
        for point in file_points:
            if point.point_kind == "safe" and abs(point.line - finding.line) <= window:
                finding.classification = "matched_fp"
                finding.expected_cwe = point.cwe_id
                point.findings.append(finding)
                point.status = "FP"
                break
        if finding.classification == "unknown":
            if file_points:
                finding.classification = "unmatched_fp"
            else:
                finding.classification = "unknown"
    for point in points:
        if point.point_kind == "vulnerable" and point.status == "UNSEEN":
            point.status = "FN"
        elif point.point_kind == "safe" and point.status == "UNSEEN":
            point.status = "TN"

# --- Metric counting and CSV output ---
def count_statuses(points: List[ExpectedPoint], findings: List[Finding]) -> Dict[str, int]:
    counts = {"TP": 0, "FP": 0, "TN": 0, "FN": 0}
    for point in points:
        if point.status in counts:
            counts[point.status] += 1
    counts["FP"] += sum(1 for f in findings if f.classification == "unmatched_fp")
    return counts

def safe_div(a, b):
    return round(a / b, 6) if b else 0.0

def write_csv(path: Path, fieldnames: List[str], rows: List[Dict]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

def write_by_cwe(out_dir: Path, tool: str, points: List[ExpectedPoint], findings: List[Finding]) -> None:
    cwes = sorted({p.cwe_id for p in points if p.cwe_id})
    rows = []

    for cwe in cwes:
        cwe_points = [p for p in points if p.cwe_id == cwe]
        cwe_findings = [f for f in findings if f.expected_cwe == cwe]

        tp = sum(1 for p in cwe_points if p.status == "TP")
        fp = sum(1 for p in cwe_points if p.status == "FP")
        tn = sum(1 for p in cwe_points if p.status == "TN")
        fn = sum(1 for p in cwe_points if p.status == "FN")
        fp += sum(1 for f in cwe_findings if f.classification == "unmatched_fp")

        precision = safe_div(tp, tp + fp) if (tp + fp) > 0 else 0
        recall = safe_div(tp, tp + fn) if (tp + fn) > 0 else 0
        specificity = safe_div(tn, tn + fp) if (tn + fp) > 0 else 0
        f1 = round(2 * precision * recall / (precision + recall), 6) if (precision + recall) > 0 else 0

        first = cwe_points[0] if cwe_points else None
        cwe_name = first.cwe_name if first else cwe

        rows.append({
            "cwe_name": cwe_name,
            "tp": tp, "fp":

Usage example:

# Analyze SARIF for a specific CWE
python3 score_juliet.py \
    --sarif sarif/semgrep/Semgrep__259__results.sarif \
    --tool semgrep

4. Analysis Results

4.1 Semgrep

4.2 OpenGrep

4.3 PMD

4.4 CodeQL

All sources and artifacts available at GitHub: LINK

NIST Juliet C/C++

by Sarmat Lutfullin (GitHub: 1sarmatt)

1. Scanned Dataset

Dataset: NIST Juliet Test Suite for C/C++ v1.3 Number of test cases: 64,099 Number of files: 106,077 .c / .cpp files Programming languages: C, C++ Number of CWE directories: 118

Included CWEs:

2. Tools Used for Scanning

Analyzers that were used:

  1. OpenGrep
  2. Semgrep
  3. PVS-Studio
  4. Joern
  5. CodeQL

Analyzers that could not be used:

  1. PMD
    • Reason: PMD does not have a full C++ parser.
  2. SonarQube Scanner
    • Reason: SonarQube Community Edition cannot analyze C/C++.

Selection summary:

3. Scanning Methodology

Scanning Scripts

3.1 OpenGrep

Scan command:

bash run_opengrep.sh

What the Script Does

#!/usr/bin/env bash
set -euo pipefail

TESTCASES_DIR="./testcases"
REPORTS_DIR="./opengrep_reports"
RULES_DIR="./opengrep_rules/c"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")

# SARIF report
opengrep scan \
  --config "$RULES_DIR" \
  --include "*.c" --include "*.cpp" --include "*.h" \
  --sarif \
  --output "$REPORTS_DIR/report_${TIMESTAMP}.sarif" \
  "$TESTCASES_DIR" || true

# JSON report
opengrep scan \
  --config "$RULES_DIR" \
  --include "*.c" --include "*.cpp" --include "*.h" \
  --json \
  --output "$REPORTS_DIR/report_${TIMESTAMP}.json" \
  "$TESTCASES_DIR" || true

Scan Result

Rules run       : 16
Files scanned   : 58 980 (git-tracked)
Findings        : 48 774

Findings by Rule

Rule Findings
insecure-use-memset 27 520
insecure-use-string-copy-fn 8 415
incorrect-use-ato-fn 6 764
insecure-use-strcat-fn 5 763
function-use-after-free 180
insecure-use-scanf-fn 102
insecure-use-gets-fn 18
double-free 12
3.2 Semgrep

Scan command (run_semgrep.sh):

bash run_semgrep.sh

What the Script Does

Runs Semgrep twice: once for a JSON report and once for a text report.

#!/usr/bin/env bash
set -euo pipefail

TESTCASES_DIR="./testcases"
REPORTS_DIR="./semgrep_reports"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")

mkdir -p "$REPORTS_DIR"

CONFIGS=("p/c" "p/default" "p/security-audit")

CONFIG_ARGS=()
for cfg in "${CONFIGS[@]}"; do
  CONFIG_ARGS+=(--config "$cfg")
done

# JSON report
semgrep \
  "${CONFIG_ARGS[@]}" \
  --include "*.c" --include "*.cpp" --include "*.h" \
  --json \
  --output "$REPORTS_DIR/report_${TIMESTAMP}.json" \
  "$TESTCASES_DIR" || true

# Text report
semgrep \
  "${CONFIG_ARGS[@]}" \
  --include "*.c" --include "*.cpp" --include "*.h" \
  --output "$REPORTS_DIR/report_${TIMESTAMP}.txt" \
  "$TESTCASES_DIR" || true

Rule Sets

Ruleset Description
p/c Basic C rules: strcpy, strcat, gets, double-free
p/default General default ruleset
p/security-audit Extended security audit
3.3 PVS-Studio

Scan Command

python run_pvs_studio.py

What the Script Does

#!/usr/bin/env python3
import subprocess, os

PVS = "pvs-studio-analyzer"
CONVERTER = "plog-converter"
TESTCASES = "C/testcases"
SUPPORT = "C/testcasesupport"
OUTPUT_DIR = "pvs_results"

# Step 1: Compilation tracing
subprocess.run([
    PVS, "trace", "--",
    "gcc", "-c", "-w",
    f"-I{TESTCASES}", f"-I{SUPPORT}",
    f"{TESTCASES}/CWE484_Omitted_Break_Statement_in_Switch/*.c",
    f"{TESTCASES}/CWE789_Uncontrolled_Mem_Alloc/s01/*.c"
])

# Step 2: Analysis
subprocess.run([
    PVS, "analyze",
    "--output-file", f"{OUTPUT_DIR}/pvs_report.log",
    "--rules-config", "pvs_rules.cfg",
    "--exclude-path", "testcasesupport"
])

# Step 3: Convert to SARIF
subprocess.run([
    CONVERTER,
    "-t", "sarif",
    "-o", f"{OUTPUT_DIR}/pvs_studio_results.sarif",
    f"{OUTPUT_DIR}/pvs_report.log"
])

Rule Configuration (pvs_rules.cfg)

[CWE-484 Omitted Break in Switch]
; V796 - A case without a break/return/goto/continue
V796=true
; V797 - The 'default' case is not the last one in the switch
V797=true

[CWE-789 Uncontrolled Memory Allocation]
; V630 - The 'malloc' function allocates memory for an object
;        whose size is specified as 0
V630=true
; V631 - The size of the allocated memory is not a multiple of
;        the element size
V631=true
; V632 - Suspicious use of 'realloc'
V632=true
; V769 - The pointer in the expression equals nullptr
V769=true

PVS-Studio enables all of its diagnostics by default: more than 700 of them. Without this configuration, it would produce thousands of warnings across the entire Juliet suite, including irrelevant V001-V799 diagnostics, which would make the results incomparable with other tools. The configuration is needed to focus only on the target CWE categories.

3.4 CodeQL

Scan Command

# Step 1: Create the database
codeql database create codeql-db \
  --language=cpp \
  --command="codeql_build.bat" \
  --source-root="C/testcases"

# Step 2: Analyze with standard queries
codeql database analyze codeql-db \
  cpp-security-and-quality.qls \
  --format=sarif-latest \
  --output=codeql_results_full.sarif

# Step 3: Analyze with custom queries
codeql database analyze codeql-db \
  custom_queries/ \
  --format=sarif-latest \
  --output=codeql_custom_results.sarif

What the Script Does

# codeql_build.bat - compile test cases
set GCC=C:\msys64\ucrt64\bin\gcc.exe
set SRC=%~dp0C\testcases

for /r "%SRC%\CWE484_Omitted_Break_Statement_in_Switch" %%f in (*.c) do (
    "%GCC%" -c -w -I"%SRC%" -I"%SUPPORT%" "%%f" -o "%%f.o"
)
for /r "%SRC%\CWE789_Uncontrolled_Mem_Alloc\s01" %%f in (*.c) do (
    "%GCC%" -c -w -I"%SRC%" -I"%SUPPORT%" "%%f" -o "%%f.o"
)

Findings by Rule

Rule Findings
All standard queries (182) 0
CWE484_MissingBreak.ql (custom) 0
CWE789_UncontrolledAlloc.ql (custom) 0

Custom Queries

CWE-484: detecting fall-through in switch statements:

/**
 * @name Missing break in switch case
 * @id cpp/cwe484-missing-break
 * @kind problem
 * @problem.severity warning
 * @tags security cwe-484
 */
import cpp

from SwitchCase sc
where
  not exists(BreakStmt bs | bs.getEnclosingStmt*() = sc) and
  not exists(ReturnStmt rs | rs.getEnclosingStmt*() = sc) and
  not exists(GotoStmt gs | gs.getEnclosingStmt*() = sc) and
  exists(sc.getNextSwitchCase())
select sc, "CWE-484: Missing break statement - falls through to next case"

CWE-789: taint analysis for uncontrolled malloc size:

/**
 * @name Uncontrolled memory allocation
 * @id cpp/cwe789-uncontrolled-alloc
 * @kind path-problem
 * @problem.severity error
 * @tags security cwe-789
 */
import cpp
import semmle.code.cpp.dataflow.TaintTracking

class JulietSource extends DataFlow::Node {
  JulietSource() {
    exists(FunctionCall fc |
      fc.getTarget().getName() in
        ["fgets", "fscanf", "recv", "recvfrom", "strtoul", "atoi", "rand"] and
      this.asExpr() = fc
    )
  }
}

class MallocSink extends DataFlow::Node {
  MallocSink() {
    exists(FunctionCall fc |
      fc.getTarget().getName() = "malloc" and
      this.asExpr() = fc.getArgument(0)
    )
  }
}

from JulietSource source, MallocSink sink
where TaintTracking::localTaint(source, sink)
select sink, source, sink,
  "CWE-789: Uncontrolled allocation size from $@", source, "external input"

Rationale for writing custom queries:

The standard CodeQL query suite (cpp-security-and-quality.qls) does not cover the target CWE categories:

CWE-484: the standard suite does not include a query for switch fall-through. CodeQL treats this as a code quality issue rather than a security vulnerability and does not include it in the security package.

CWE-789: the standard TaintedAllocationSize.ql query exists, but it is configured for a narrow set of sources. Juliet uses specific sources such as rand(), fgets(), and strtoul(), which are not included in the default taint source configuration.

3.5 Joern

Scan Command

python run_joern_analysis.py

What the Script Does

# Import test cases into CPG
joern.importCode("C/testcases", projectName="juliet_cpg")

# CWE-484 query: switch without break
val cwe484 = cpg.controlStructure
  .controlStructureType("SWITCH")
  .flatMap { sw =>
    val cases = sw.astChildren.isControlStructure
      .controlStructureType("CASE|DEFAULT").l
      .sortBy(_.lineNumber.getOrElse(0))
    cases.zipWithIndex.flatMap { case (c, i) =>
      val hasExit = c.ast.isControlStructure
        .controlStructureType("BREAK|RETURN|GOTO|CONTINUE").nonEmpty
      if (!hasExit && i < cases.length - 1) Some(c) else None
    }
  }.l

# CWE-789 query: malloc with an external source
val cwe789 = cpg.call.name("malloc|calloc|realloc")
  .filter { call =>
    call.method.ast.isCall
      .name("fgets|fscanf|recv|recvfrom|strtoul|atoi|scanf|read")
      .nonEmpty
  }.l

Findings by Rule

Rule Findings
CWE789-uncontrolled-memory-allocation 472
CWE484-omitted-break-in-switch 0

4. Analysis Results by Analyzer

4.1 OpenGrep

Metric Value
Total findings 48 774
TP 3 721
FP 40 957
TN 85 893
FN 23 969
CWE dirs with TP 26 / 118
Precision 0.0833
Recall 0.1344
Specificity 0.6771
F1 0.1028

4.2 Semgrep

Metric Value
Total findings 14 310
TP 503
FP 12 454
TN 112 077
FN 27 187
CWE dirs with TP 13 / 118
Precision 0.0388
Recall 0.0182
Specificity 0.9000
F1 0.0247

4.3 PVS-Studio

Metric Value
Total findings 35 525
TP 1 372
FP 12 151
TN 716 719
FN 12 151
Precision 0.101457
Recall 0.101457
Specificity 0.983329
F1 0.101457

4.4 CodeQL

Metric Value
Total findings 0
TP 0
FP 0
TN 124 531
FN 32 130
CWE dirs with TP 0 / 118
Precision 0.0000
Recall 0.0000
Specificity 1.0000
F1 0.0000
Technical Reasons for the Zero Result

Juliet conditional compilation hides the vulnerable code.

Juliet intentionally wraps all vulnerable code in preprocessor directives:

#ifndef OMITBAD
void CWE484_..._bad() {
    switch (x) {
    case 0:
        printLine("0");   // vulnerability: no break
    case 1: ...
    }
}
#endif

4.5 Joern

Metric Value
Total findings 472
TP 176
FP 294
TN 124 237
FN 31 954
CWE dirs with TP 1 / 118
Precision 0.3745
Recall 0.0055
Specificity 0.9976
F1 0.0108

Sources available at GitHub: LINK

Conclusion

Precision

prec

Precision shows how many reported findings were actually relevant vulnerabilities. In this study, high precision usually means that the analyzer was conservative: it reported fewer issues, but a larger share of them matched the expected vulnerability points.

This is visible in several NIST Juliet results. For example, Semgrep on Juliet Java reaches perfect precision in the local scoring, but this does not mean it is the best analyzer overall: it found only a very small subset of the actual vulnerable points. CodeQL and PVS-Studio also show relatively strong precision on NIST Juliet C# because their findings are more targeted and many out-of-scope findings are excluded by CWE-aware scoring. In contrast, OpenGrep and Semgrep on Juliet C/C++ have low precision because broad pattern-based rules produce many findings that do not match the exact expected CWE locations. OWASP Benchmark shows more balanced precision because its test cases are web-oriented and align better with common SAST rules.

Therefore, precision alone is not enough to judge tool quality. A tool may be precise because it reports only the easiest or most narrowly defined vulnerabilities.

Recall

rec

Recall is the main weak point observed across the study. It measures how many known vulnerable points were actually detected. On OWASP Benchmark and IAMeter, several tools reach high recall because the datasets are smaller or closer to common web vulnerability patterns. For example, OWASP Java and IAMeter Java/Go are much easier for rule-based tools to cover.

NIST Juliet changes the picture. Juliet contains many CWE categories, synthetic variants, legacy patterns, conditional compilation, and language-specific edge cases. Most analyzers detect only a small fraction of these cases. This is especially visible for NIST Juliet Java and C/C++, where recall is low even when precision or specificity is high. The reason is not only tool weakness: some tools do not support the required runtime well, some default rule packs do not cover the target CWE categories, and some vulnerabilities are encoded in ways that are difficult for default SAST configurations to recognize.

The practical implication is important: low recall means that a clean SAST report cannot be treated as evidence that the code is safe. It may only mean that the enabled rules did not cover the benchmark’s vulnerability patterns.

F1-Score

f1

F1-score balances precision and recall, so it is the most useful single metric in this study. It penalizes both noisy tools and overly conservative tools. The F1 tables show that the best results appear mostly on OWASP Benchmark and IAMeter, where some tools achieve a reasonable balance between finding vulnerabilities and avoiding false positives.

On NIST Juliet, F1 drops sharply for most analyzers. This happens for two different reasons: some tools have acceptable precision but very low recall, while others find more issues but introduce many false positives. Both cases lead to weak F1. For example, high precision with near-zero recall still results in a low F1-score, because the analyzer misses most known vulnerabilities. Conversely, broad pattern matching may improve recall slightly, but false positives reduce precision and keep F1 low.

For this reason, F1 is the clearest summary of the trade-off observed in the research: SAST tools can be useful, but their default configurations rarely provide both broad coverage and high confidence across heterogeneous benchmarks.

Specificity and Accuracy

ac

The table represents specificity: the ability to correctly ignore safe code. Specificity is often high on NIST Juliet because the number of safe or non-triggered points is large and many tools report few findings. This makes specificity useful, but potentially misleading if interpreted alone.

For example, CodeQL on Juliet C/C++ has perfect specificity because it reports no findings, but this also produces zero recall and zero F1. In that case, high specificity does not mean strong vulnerability detection; it means the tool did not raise false alarms while also missing all known vulnerabilities. Similarly, high specificity in conservative configurations should be read together with recall and F1.

Thus, specificity is most useful for estimating noise and false-positive behavior, not for measuring vulnerability coverage. A practical SAST evaluation must consider specificity together with precision, recall, and F1.

Cross-Benchmark Observations

OWASP Benchmark produced the strongest overall results. Its web vulnerability categories are closer to the default rule sets of tools such as Semgrep, OpenGrep, CodeQL, SonarQube, PVS-Studio, and Joern. As a result, recall and F1 are generally higher there than on Juliet.

IAMeter is small and easier to inspect manually. Semgrep, OpenGrep, and CodeQL performed well on IAMeter Java/Go for the expected CWE categories, while several other analyzers found little or nothing. IAMeter PHP remained difficult for most tools in this configuration, which shows how language support and rule availability directly affect the result.

NIST Juliet is the harshest benchmark in this study. It contains a much wider set of CWE categories and many synthetic variants. It is useful for stress-testing coverage, but it also exposes limitations in default rules, language support, build requirements, and matching methodology. The poor recall on Juliet does not mean SAST tools are useless; it means default SAST configurations do not provide comprehensive CWE coverage on large synthetic suites.

Overall, none of the evaluated SAST tools demonstrated consistently high performance across all benchmarks, languages, and CWE categories. The results depend strongly on the programming language, benchmark design, enabled rule set, tool support for the target runtime, and the way findings are matched against ground truth. The same analyzer can look strong on one benchmark and weak on another, so the results should be interpreted as benchmark-specific evidence rather than as a universal ranking of tools.

Limitations

Several limitations affect the interpretation of this research:

The final conclusion is that SAST tools are valuable as part of a security workflow, but they should not be treated as complete vulnerability detectors. Their output depends on rules, language support, build context, and benchmark structure. In practice, the strongest approach is to combine several analyzers, tune rules for the target language and CWE classes, and validate findings with a scoring method that separates precision, recall, specificity, and F1 rather than relying on a single metric.