How Can Open-Source LLMs Be Used in CTI?
The adoption of Language Models (LLMs) has become prevalent in various applications, including Cyber Threat Intelligence (CTI). As one might anticipate, these sophisticated models are frequently employed to provide a comprehensive understanding of textual data, aiding analysts in extracting valuable insights from a vast sea of information. However, incorporating LLMs into business processes raises valid concerns, encompassing cost implications, scalability issues, and compliance considerations. To address these challenges effectively, an alternative solution comes to the forefront: Open-Source LLMs.
While the spotlight often shines on proprietary models like the GPT series by OpenAI, recent developments in open-source models have showcased remarkable performance. Embracing an open-source LLM grants the flexibility to integrate it seamlessly into existing infrastructure, allowing for a tailored and controlled deployment. The advantages also extend to mitigating cost, scalability, and the compliance issues we just mentioned – aspects we will delve into as we conclude this blog post.
Embarking on our exploration of open-source LLMs, let’s first unravel their utility in CTI practices. How do these models contribute to the field, and what specific purposes do they serve in the intricate landscape of Cyber Threat Intelligence?
How Can LLMs Facilitate CTI Practices?
LLMs extend their capabilities to the field of Cyber Threat Intelligence, offering more than routine responses. Let’s explore specific use cases where LLMs prove valuable in CTI:
- Entity Extraction: Identifying and extracting entities (e.g., IP addresses, domain names, email addresses) from unstructured text data. This helps in rapidly detecting and cataloging potential threats.
- Summarization: Generating concise and informative summaries of lengthy documents or articles related to cyber threats. This aids analysts in quickly grasping the key information and insights without having to read through extensive text.
- Report Generation: Automating the creation of threat intelligence reports by generating human-readable narratives based on the analysis of various data sources. LLMs can assist in synthesizing information and presenting it in a coherent and understandable format.
- Vulnerability Analysis: Analyzing security advisories, patch notes, and other technical documents to extract information about vulnerabilities and potential exploits.
- Creating a Knowledge Base: Enriching a cybersecurity knowledge base by automatically updating and expanding it with the latest threat intelligence information available online.
- Sentiment Analysis: Analyzing the sentiment expressed in forums, social media, or other online platforms to gauge the reactions of threat actors. This can provide insights into the urgency and severity of potential threats.
- Anomaly Detection: Employing LLMs to identify anomalies in system logs, network traffic, or other data sources. This helps in detecting unusual patterns that may indicate a security breach or potential threat.
- Pattern Recognition: Recognizing patterns in historical threat data to identify similarities and trends. This can help in predicting potential future threats based on past incidents.
- Phishing Detection: Analyzing emails, messages, or other communication channels for signs of phishing attempts. LLMs can identify suspicious language patterns and content that may indicate a phishing campaign.
- Language Translation: Translating threat intelligence information from different languages to facilitate collaboration and information sharing on a global scale.
The list encompasses a substantial array of tasks that LLMs can fulfill within the realm of Cyber Threat Intelligence. While LLMs generally possess capabilities aligned with the key purposes mentioned above, evaluating their performance in each specific task becomes a major consideration before deciding to adopt an LLM in a business context. The effectiveness of these models in addressing distinct CTI challenges can significantly influence their suitability for particular applications.
Benchmarking Open-Source LLMs for CTI Use Cases
Choosing the right LLM for your organization involves a meticulous evaluation of each model’s capabilities and constraints. Benchmarking, by allowing organizations to test LLMs in real-life scenarios, helps assess their suitability for specific use cases. The diversity of benchmark tests spans various aspects, including grammar, context comprehension, data handling, security of generated code, and more. Given the absence of a one-fits-all, universally perfect LLM, benchmarking becomes a critical step in finding an optimal fit for organizational needs.
That is why, for CTI purposes, we will give some benchmarking examples related to code generation. Code generation can be useful in CTI for purposes such as automating certain intelligence-gathering or analyzing processes, testing out an exploit for a vulnerability, as well as in recognizing security gaps in SDLC (Software Development Lifecycle) pipelines. Let’s examine how well a popular open-source model, Llama 2, performs in code generation tests.
CyberSecEval, a benchmark introduced by Meta, focuses on evaluating LLMs for critical cybersecurity concerns, like the potential for AI-generated code to deviate from established security best practices and the introduction of exploitable vulnerabilities.
The CyberSecEval benchmark evaluates seven models from the Llama2, codeLlama, and OpenAI GPT large language model families; since we took Llama 2 as an example, it is crucial we heed to the results of this benchmark. According to the research paper, which we discussed in a previous blog post, insecure coding suggestions were prevalent across all models, with LLMs generating vulnerable code 30% of the time over CyberSecEval’s test cases. This means that approximately, every 1 of 3 AI-Generated code is vulnerable.
In contrast to the general code generation results, Llama 2 models outperformed Code Llama models and GPT-3.5-turbo in resisting compliance with cyberattacks, as illustrated in the chart below:
As a benchmark to test functional correctness, HumanEval can also evaluate the multilingual ability of code generation models. HumanEval is an evaluation harness for the HumanEval problem solving dataset, which measures functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions
In a leaderboard of LLMs benchmarked for code generation practices by HumanEval, Llama 2 holds the 32nd position. Models like GPT-4, CodeLlama, and Gemini claim higher ranks, suggesting better performance in code generation than Llama 2, according to HumanEval benchmark.
Exploring Some of the Top Open-Source LLMs
As we are introducing the usefulness of open-source LLMs in the field of Cyber Threat Intelligence, understanding the landscape of available models becomes significant. Let’s spotlight notable open-source LLMs – Mixtral 8x7B, Vicuna, Falcon, and WizardLM – each contributing uniquely to the diverse applications which can be useful in CTI practices.
Mistral AI’s Mixtral 8x7B stands as a revolutionary open-weight model surpassing GPT-3.5 performance. Its Mixture of Experts (MoE) architecture ensures exceptional speed, making it suitable for chatbot use-cases when run on 2x A100s. With approximately 56 billion parameters, Mixtral adeptly handles a 32k token context, supports multiple languages, and excels in code generation. Finetuning transforms it into an instruction-following model with a notable MT-Bench score of 8.3.
Vicuna, an open-source chatbot by LMSYS (Large Model Systems Organization – an organization that develops large models and systems that are openly accessible, and scalable), fuses together the strengths of LLaMA and Alpaca. Derived from 125K user conversations, it achieves an impressive 90% ChatGPT quality, matching GPT-4. Tailored for researchers and hobbyists, Vicuna comes in two sizes—7 billion and 13 billion parameters. While akin to “GPT-3.25,” it outperforms GPT-3 in power and effectiveness, backed by the robustness of LLaMA.
Developed by the Technology Innovation Institute, the Falcon LLM redefines AI language processing. With models like Falcon-40B and Falcon-7B, it delivers innovative capabilities for diverse applications. Trained on RefinedWeb, the open-source Falcon LLM features multi-query attention, ensuring scalability. Its Apache 2.0 license encourages widespread usage, setting it apart with custom tooling and enhancements like rotary positional embeddings and multi-query attention.
WizardLM distinguishes itself in tests against GPT-4 in 24 skills, outperforming even in technical domains and skills like academic writing, chemistry, and physics. This project focuses on enhancing LLMs by utilizing AI-evolved instructions for training, providing a practical option for more challenging tasks. While not claiming absolute superiority over ChatGPT, WizardLM’s preference in certain scenarios suggests the potential of AI-evolved instructions in advancing LLMs.
And an addition to the list as a rising model, Phi-2:
Phi-2, a novel, yet compact and capable Transformer model with 2.7 billion parameters, is trained on the same data sources as Phi-1.5 and enriched with a new source. Despite the lack of fine-tuning through reinforcement learning based on human feedback, it is a capable open-source model designed to help the research community address critical safety challenges, such as reducing toxicity, identifying societal biases, improving controllability, and more. Phi-2 outperforms models with fewer than 13 billion parameters in tests of common sense, language comprehension, and logical reasoning.
Comparing the Capabilities of Open-Source LLMs
In our comparative analysis of two successful open-source LLMs, Mixtral 8x7b and WizardLM, we initiated the evaluation with two warm-up prompts, asking these LLMs to create a code for any cybersecurity task, and then to create a fictional phishing scenario.
After that, we tasked them with creating a Python script for a specific cybersecurity task – a file integrity check. WizardLM (70b) adeptly followed our instructions, guiding us through the process. In contrast, Mixtral 8x7b (instruct-v0.1) presented an alternative script, focusing on identifying malicious IP addresses in a log file rather than the requested file integrity check.
Moving forward, we directed both LLMs to generate a short phishing email targeting a ‘real’ company.
Although both complied and crafted convincing narratives, WizardLM’s response demonstrated a nuanced touch, introducing a sense of urgency by emphasizing the need for the victim to update information “within the next 24 hours” or face certain consequences.
Our intention was to explore the human-like qualities in handling malicious scenarios, revealing both LLMs’ proficiency in recognizing patterns within a phishing email.
While testing their ability to identify TTPs in a fabricated cyberattack scenario, both named MITRE ATT&CK TTPs, as expected. However, WizardLM provided more detailed information and the TTPs’ specific identifiers. It is worth noting that WizardLM, while more detailed than Mixtral 8x7b, exhibited an error, falsely associating the threat actors’ ransom demand with Privilege Escalation.
In another example, we tasked these LLMs with performing entity extraction from a hacker forum post, related to a data leak concerning Target. We provided a very recent data leak post, with specific items we sought from such posts, including the threat actor, leak type, data type, and more. The following are the results.
You can run your own small benchmark tests as we did, leveraging platforms like LMSYS, evaluating LLMs based on your specific needs. The comparison we provided in this section highlighted the strengths and nuances of Mixtral 8x7b and WizardLM, offering insights into their performance across some cybersecurity-related prompts.
Turning our attention to the broader landscape of open-source LLMs, we have introduced some of the best performers, which could also be helpful in the CTI domain, emphasizing their value in enhancing threat intelligence practices.
The advantages of opting for open-source solutions, notably cost, scalability, and compliance, play a major role in their appeal. While proprietary LLMs often come with costs per request, open-source alternatives are typically free, enabling more cost-effective implementations. Scalability considerations are also intertwined with financial implications, with open-source solutions potentially providing greater flexibility.
From a compliance standpoint, open-source LLMs offer a safer alternative, especially when dealing with sensitive data like medical records, Personally Identifiable Information (PII), or client data. The risk of proprietary LLMs potentially incorporating your data into training datasets or facing leaks raises concerns about data privacy and security. Running open-source LLMs on your own infrastructure ensures a higher level of control and compliance.
However, it is essential to acknowledge that, despite the advantages, many open-source LLM projects may not be as advanced as their proprietary counterparts. Proprietary LLMs often boast superior capabilities and resources for addressing a diverse range of prompts. It is required to weigh these considerations against the specific needs and constraints of your CTI practices.
Additionally, one notable point of discretion lies in the commercial usage of open-source LLMs. While these models are freely available for non-commercial purposes, businesses intending to utilize them for commercial applications may need to acquire a commercial license. This distinction underscores the need for careful consideration of licensing terms and adherence to legal and ethical guidelines when integrating open-source LLMs into commercial environments.