Skip to main content
Kent Academic Repository

An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection

Çetin, Orçun, Ekmekcioglu, Emre, Arief, Budi, Hernandez-Castro, Julio (2024) An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection. Journal of Universal Computer Science, 30 (9). pp. 1163-1183. ISSN 0948-695X. E-ISSN 0948-6968. (doi:10.3897/jucs.134739) (KAR id:107949)

Abstract

Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However – as is the case with any complex software system – creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs’ false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.

Item Type: Article
DOI/Identification number: 10.3897/jucs.134739
Uncontrolled keywords: ChatGPT, Claude, Bard, Gemini, Llama-2, Static code analysis, PHP vulnerabilities, Vulnerability detection, LLM in cybersecurity
Subjects: Q Science > QA Mathematics (inc Computing science)
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Funders: University of Kent (https://ror.org/00xkeyj56)
Depositing User: Budi Arief
Date Deposited: 27 Nov 2024 09:09 UTC
Last Modified: 05 Dec 2024 09:45 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/107949 (The current URI for this page, for reference purposes)

University of Kent Author Information

  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.