PhiloCyber logo
PhiloCyberby Richie Prieto
Academic Research

Can LLM's Find and Fix Vulnerable Software?

Can LLM's Find and Fix Vulnerable Software?
0 views
6 min read
#Academic Research

Hello, how are you today? I hope you're doing great over there!! It's been a couple of weeks since I wrote my last post; many things happened in between... The good news is that I finally found the time to connect with one of the things that I love the most, writing.

Today, I bring you something new on the blog, which I hope will eventually become more common. We will be looking together at an academic paper about security code review and how AI, or rather, Large Language Models (LLMs), perform in detecting and remediating vulnerabilities in 8 different programming languages compared to software like Fortify and Snyk.

It is an incredible piece of research titled "Can Large Language Models Find and Fix Vulnerable Software?", written by David Noever in 2023.

Brief description about the research made by the author:

In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.

So, in the ever improving field of software development, security as usual, is a critical concern. Identifying and changing weak code practices during the development process is essential to maintain healthy security standards. This research paper presents a comparative study of LLMs and traditional third-party software tools in the task of static code analysis.

We explore how AI, specifically LLMs, can not only match but even outperform existing vendors in the market for detecting and fixing security issues.

The findings demonstrate the efficacy of AI in enhancing code security and introduce a pioneering tool that aligns with the growing need for versatile, accessible, and robust security solutions in software engineering. This paper aims to provide a comprehensive understanding of the potential of AI in improving software security, promoting safer coding practices.


Introduction

Developing software has always been considered a complex endevour, even if it's a personal project, like if we're creating our own tool at our job to automate a certain task that we don't want to do routinely.

But imagine companies whose main business is creating software, seeking every day to provide more and better services... companies that ultimately live off creating quality software, maintaining, and improving creative ideas.

The software development processes in such companies are even more complex. They involve many people, different teams, whether internal or external to the company, and many back-and-forths before the final release to production. For the Continuous Integration and Continuous Delivery (CI/CD) software process, many companies add numerous static code analyzers, also known as SAST, to their development pipelines as if it were an enforced standard.

We have GitHub, Snyk, Fortify, Sonar, and many others that detect errors and potential vulnerabilities even before any human eye sees that code to approve or reject the Pull Request. All this is an attempt to develop quality software faster, cheaper, and more securely than ever before.

Now imagine adding LLMs to your pipeline, capable of understanding the code in a different way and performing analysis in that same pipeline where other tools are being run to evaluate code... it sounds incredible, doesn't it?

Well, this research paper is about that... It's about rethinking how we can improve our secure development cycle by adding LLMs, showing how these in their early stages, are already outperforming well-known and mature tools for detecting vulnerable software.

The growing complexity of software systems requires advanced methods to ensure their security. Traditional static code analyzers, like Fortify or Snyk, have been fundamental in identifying software vulnerabilities, but sometimes they can overlook very subtle issues or analyze a Pull Request (PR) with an isolated approach instead of a more encompassing and specific approach to an entire application (context), rather than just a mere PR within a complex system. Even in their early stages of development, LLMs are outperforming current tools (year 2023), imagine what could happen in a few years with LLMs growing at the current pace or even faster... dream about it!

I truly believe that the paradigm will change. The human eye will always be there for those "fun" manual code reviews, but I really think that AI will accelerate and improve processes really fast, thus creating a more agile, secure, and quality system like we've never seen before in the industry.


Methodology

This study evaluates the capability of several LLMs (with a special focus on OpenAI's GPT-4), in detecting software vulnerabilities, comparing its performance with traditional static code analyzers. Numerous repositories were reviewed, including those from NASA and the Department of Defense, using both, LLMs and code analyzers, to contrast their effectiveness.

The prompt used by the researcher was really simple, but higly effective:

Act as the world's greatest static code analyzer for all major programming languages. I will give you a code snippet, and you will analyze the code and rewrite it, removing any identified vulnerabilities. Do not explain, just return the corrected code and format alone.

Testing it with known static development tools and with different OpenAI LLMs like Ada, Curie, DaVinci (GPT-3 and GPT-3 Turbo), and finally GPT-4 on a package of 128 vulnerable code snippets with 33 categories of vulnerabilities and in the 8 most used programming languages (C, Ruby, PHP, Java, Javascript, C#, Go, and Python).


Results

GPT-4 identified approximately four times more vulnerabilities than its counterparts. Additionally, it provided viable corrections for each vulnerability, demonstrating a low false-positive rate. GPT-4's code fixes led to a 90% reduction in vulnerabilities, with only an 11% increase in lines of code.

The results were crazy:

Vulnerability CategoryGPT4 Detected VulnerabilitiesSnyk Detected Vulnerabilities
Path Traversal4616
File Inclusion4012
Command Injection3413
SQL Injection306
Unsafe Deserialization252
Insecure File Uploads300
PHP Object Injection180
Cross-site Scripting (XSS)1711
Buffer Overflow160
Denial Of Service145
Server Side Template Injection110
Connection String Injection110
XML External Entity (XXE) Injection113
PostMessage Security100
Code Injection91
LDAP Injection90
Sensitive Data Exposure61
Open Redirect61
SNIPSNIPSNIP
Grand Total39398

Comparison by Vulnerability Categories for the GPT4 LLM vs Snyk

Clearly, the research specifically exposes the output according to the different types of vulnerabilities and how it varies between the tool and the most powerful LLM's on the research. I invite you to read the full results, which, within the world of academic research, are an easy read.

Comparison of Total Vulnerabilities Found by GPT-4 vs Snyk For GPT-4 Vulnerabilities (green) and Snyk Vulnerabilities (red)

Image

For example, in this other table, we can see how many of the found vulnerabilities have been fixed. Although we can see that there are some errors in the data, for instance, for File Inclusion, XSS, or Command Injection where there are more fixed than identified vulnerabilities, which doesn't make much sense to be honest.

Vulnerability CategoryGPT-4 VulnerabilitiesGPT-4 Fixes
Path Traversal4646
File Inclusion4045
Command Injection3443
SQL Injection3026
Unsafe Deserialization2523
Insecure File Uploads3030
PHP Object Injection1818
Cross-site Scripting (XSS)1718
Buffer Overflow1614
Denial Of Service1414
Server Side Template Injection1111
Connection String Injection1111
XML External Entity (XXE) Injection1111
PostMessage Security1010
Code Injection910
LDAP Injection99
Sensitive Data Exposure66
Open Redirect67
SNIPSNIPSNIP

Table: Comparison of Vulnerabilities and Fixes by GPT-4

An important finding of this research was the LLMs' ability to self-audit, suggesting corrections for their identified vulnerabilities, which underscores their accuracy. The results suggest that LLMs can effectively complement traditional methods of vulnerability detection, offering a much broader perspective on software security.


Conclusion

The research shows the great potential that LLMs have in detecting and fixing software vulnerabilities. Even in these early stages of the technology, they are already outperforming mature solutions that have been on the market for a few years! That's really good... but I truly believe the future is going to be even more crazy!!

When AI becomes a super specialist, specialized Retrieval-Augmented Generation (RAG) applications for code analysis will be an incredible addition to the secure development pipeline in any company, and hopefully, at a lower price.

I hope you enjoyed this reading as much as I enjoyed reading the research paper! See you in the next post!

Thanks for sharing your time with me!! All the best!

Richie

For more information about the research please check the following resource.

Image