Can LLM's Find and Fix Vulnerable Software?
Table Of Content
Hello, how are you today? I hope you're doing great over there!! It's been a couple of weeks since I wrote my last post; many things happened in between... The good news is that I finally found the time to connect with one of the things that I love the most, writing.
Today, I bring you something new on the blog, which I hope will eventually become more common. We will be looking together at an academic paper about security code review and how AI, or rather, Large Language Models (LLMs), perform in detecting and remediating vulnerabilities in 8 different programming languages compared to software like Fortify and Snyk.
It is an incredible piece of research titled "Can Large Language Models Find and Fix Vulnerable Software?", written by David Noever in 2023.
Brief description about the research made by the author:
In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.
So, in the ever improving field of software development, security as usual, is a critical concern. Identifying and changing weak code practices during the development process is essential to maintain healthy security standards. This research paper presents a comparative study of LLMs and traditional third-party software tools in the task of static code analysis.
We explore how AI, specifically LLMs, can not only match but even outperform existing vendors in the market for detecting and fixing security issues.
The findings demonstrate the efficacy of AI in enhancing code security and introduce a pioneering tool that aligns with the growing need for versatile, accessible, and robust security solutions in software engineering. This paper aims to provide a comprehensive understanding of the potential of AI in improving software security, promoting safer coding practices.
Introduction
Developing software has always been considered a complex endevour, even if it's a personal project, like if we're creating our own tool at our job to automate a certain task that we don't want to do routinely.
But imagine companies whose main business is creating software, seeking every day to provide more and better services... companies that ultimately live off creating quality software, maintaining, and improving creative ideas.
The software development processes in such companies are even more complex. They involve many people, different teams, whether internal or external to the company, and many back-and-forths before the final release to production. For the Continuous Integration and Continuous Delivery (CI/CD) software process, many companies add numerous static code analyzers, also known as SAST, to their development pipelines as if it were an enforced standard.
We have GitHub, Snyk, Fortify, Sonar, and many others that detect errors and potential vulnerabilities even before any human eye sees that code to approve or reject the Pull Request. All this is an attempt to develop quality software faster, cheaper, and more securely than ever before.
Now imagine adding LLMs to your pipeline, capable of understanding the code in a different way and performing analysis in that same pipeline where other tools are being run to evaluate code... it sounds incredible, doesn't it?
Well, this research paper is about that... It's about rethinking how we can improve our secure development cycle by adding LLMs, showing how these in their early stages, are already outperforming well-known and mature tools for detecting vulnerable software.
The growing complexity of software systems requires advanced methods to ensure their security. Traditional static code analyzers, like Fortify or Snyk, have been fundamental in identifying software vulnerabilities, but sometimes they can overlook very subtle issues or analyze a Pull Request (PR) with an isolated approach instead of a more encompassing and specific approach to an entire application (context), rather than just a mere PR within a complex system. Even in their early stages of development, LLMs are outperforming current tools (year 2023), imagine what could happen in a few years with LLMs growing at the current pace or even faster... dream about it!
I truly believe that the paradigm will change. The human eye will always be there for those "fun" manual code reviews, but I really think that AI will accelerate and improve processes really fast, thus creating a more agile, secure, and quality system like we've never seen before in the industry.
Methodology
This study evaluates the capability of several LLMs (with a special focus on OpenAI's GPT-4), in detecting software vulnerabilities, comparing its performance with traditional static code analyzers. Numerous repositories were reviewed, including those from NASA and the Department of Defense, using both, LLMs and code analyzers, to contrast their effectiveness.
The prompt used by the researcher was really simple, but higly effective:
Act as the world's greatest static code analyzer for all major programming languages. I will give you a code snippet, and you will analyze the code and rewrite it, removing any identified vulnerabilities. Do not explain, just return the corrected code and format alone.
Testing it with known static development tools and with different OpenAI LLMs like Ada, Curie, DaVinci (GPT-3 and GPT-3 Turbo), and finally GPT-4 on a package of 128 vulnerable code snippets with 33 categories of vulnerabilities and in the 8 most used programming languages (C, Ruby, PHP, Java, Javascript, C#, Go, and Python).
Results
GPT-4 identified approximately four times more vulnerabilities than its counterparts. Additionally, it provided viable corrections for each vulnerability, demonstrating a low false-positive rate. GPT-4's code fixes led to a 90% reduction in vulnerabilities, with only an 11% increase in lines of code.
The results were crazy:
Vulnerability Category | GPT4 Detected Vulnerabilities | Snyk Detected Vulnerabilities |
---|---|---|
Path Traversal | 46 | 16 |
File Inclusion | 40 | 12 |
Command Injection | 34 | 13 |
SQL Injection | 30 | 6 |
Unsafe Deserialization | 25 | 2 |
Insecure File Uploads | 30 | 0 |
PHP Object Injection | 18 | 0 |
Cross-site Scripting (XSS) | 17 | 11 |
Buffer Overflow | 16 | 0 |
Denial Of Service | 14 | 5 |
Server Side Template Injection | 11 | 0 |
Connection String Injection | 11 | 0 |
XML External Entity (XXE) Injection | 11 | 3 |
PostMessage Security | 10 | 0 |
Code Injection | 9 | 1 |
LDAP Injection | 9 | 0 |
Sensitive Data Exposure | 6 | 1 |
Open Redirect | 6 | 1 |
SNIP | SNIP | SNIP |
Grand Total | 393 | 98 |
Comparison by Vulnerability Categories for the GPT4 LLM vs Snyk
Clearly, the research specifically exposes the output according to the different types of vulnerabilities and how it varies between the tool and the most powerful LLM's on the research. I invite you to read the full results, which, within the world of academic research, are an easy read.
Comparison of Total Vulnerabilities Found by GPT-4 vs Snyk For GPT-4 Vulnerabilities (green) and Snyk Vulnerabilities (red)
For example, in this other table, we can see how many of the found vulnerabilities have been fixed. Although we can see that there are some errors in the data, for instance, for File Inclusion, XSS, or Command Injection where there are more fixed than identified vulnerabilities, which doesn't make much sense to be honest.
Vulnerability Category | GPT-4 Vulnerabilities | GPT-4 Fixes |
---|---|---|
Path Traversal | 46 | 46 |
File Inclusion | 40 | 45 |
Command Injection | 34 | 43 |
SQL Injection | 30 | 26 |
Unsafe Deserialization | 25 | 23 |
Insecure File Uploads | 30 | 30 |
PHP Object Injection | 18 | 18 |
Cross-site Scripting (XSS) | 17 | 18 |
Buffer Overflow | 16 | 14 |
Denial Of Service | 14 | 14 |
Server Side Template Injection | 11 | 11 |
Connection String Injection | 11 | 11 |
XML External Entity (XXE) Injection | 11 | 11 |
PostMessage Security | 10 | 10 |
Code Injection | 9 | 10 |
LDAP Injection | 9 | 9 |
Sensitive Data Exposure | 6 | 6 |
Open Redirect | 6 | 7 |
SNIP | SNIP | SNIP |
Table: Comparison of Vulnerabilities and Fixes by GPT-4
An important finding of this research was the LLMs' ability to self-audit, suggesting corrections for their identified vulnerabilities, which underscores their accuracy. The results suggest that LLMs can effectively complement traditional methods of vulnerability detection, offering a much broader perspective on software security.
Conclusion
The research shows the great potential that LLMs have in detecting and fixing software vulnerabilities. Even in these early stages of the technology, they are already outperforming mature solutions that have been on the market for a few years! That's really good... but I truly believe the future is going to be even more crazy!!
When AI becomes a super specialist, specialized Retrieval-Augmented Generation (RAG) applications for code analysis will be an incredible addition to the secure development pipeline in any company, and hopefully, at a lower price.
I hope you enjoyed this reading as much as I enjoyed reading the research paper! See you in the next post!
Thanks for sharing your time with me!! All the best!
Richie
For more information about the research please check the following resource.