SOCRadar® Cyber Intelligence Inc. | ShadowRay Campaign Exploits Critical Ray Framework Vulnerabilities to Compromise AI Workloads Globally

Home

Resources

Blog

Apr 26, 2024

11 Mins Read

ShadowRay Campaign Exploits Critical Ray Framework Vulnerabilities to Compromise AI Workloads Globally

Since September 5, 2023, a sophisticated cyber threat named the ‘ShadowRay’ campaign has targeted vulnerabilities in the Ray framework.

This campaign highlights a critical breach in the Ray framework, developed by Anyscale and utilized by giants such as Amazon and OpenAI, to orchestrate vast AI and Python applications across sectors like education, cryptocurrency, and biopharma.

With Ray’s GitHub repository amassing over 30,500 stars, its compromise has marked a significant event in cyber espionage, affecting hundreds of servers and leaking sensitive data.

Some of the many users of Ray (Source: ray.io)

How Do Attackers Exploit Ray’s Jobs API?

The absence of authorization in Ray’s Jobs API poses a significant security risk, opening the door for potential abuse by unauthorized users, due to the CVE-2023-48022 vulnerability.

The vulnerability becomes particularly concerning given that anyone with access to the dashboard network (HTTP port 8265) could potentially execute arbitrary jobs on the remote host without any form of authentication.

Ray’s official documentation emphasizes the importance of security best practices, stressing that security and isolation measures must be enforced outside of the Ray Cluster. However, it’s clear that without proper authorization mechanisms in place, the risk of unauthorized access remains unmitigated.

Anyscale, the company behind Ray, asserts that users bear the responsibility for ensuring the locality and security of Ray’s execution capabilities. They suggest that the dashboard should either be inaccessible from the internet or restricted to trusted parties only.

However, this approach leaves a critical gap in security, as it relies on the assumption of a safe environment with adequate routing logic such as network isolation, Kubernetes namespaces, firewall rules, or security groups.

Vulnerability card of CVE-2023-48022 (SOCRadar Vulnerability Intelligence)

Many in the industry were unaware of the vulnerabilities present in Ray’s Dashboard, especially those associated with its Jobs API. Additionally, the disclosure of these vulnerabilities, including the mentioned CVE, had not garnered widespread attention soon enough.

For instance, Ray’s official Kubernetes deployment guide and Kuberay’s Kubernetes operator encourage exposing the dashboard on 0.0.0.0, potentially exacerbating the risk by making it accessible to a wider range of network entities.

Kuberay’s Kubernetes operator guides

MITRE has assigned a critical score of 9.8 to the security vulnerability associated with Ray’s Jobs API. However, it is important to note that this vulnerability is currently tagged as “disputed” within the MITRE framework.

A note that the CVE is disputed

Exploited Ray Clusters in the Wild: What Sensitive Data Was Compromised?

When hackers breach a Ray production cluster, they gain access to valuable company data and can run their own code without detection, exploiting the limitations of traditional security tools. This has resulted in the leakage of a substantial amount of sensitive information from the compromised servers.

The attackers were able to manipulate AI production workloads, potentially affecting the accuracy and reliability of AI models, and impacting various industries including healthcare, video analytics, and top-tier schools.

The breach of Ray production clusters granted attackers access to production database credentials, enabling them to quietly download entire databases and tamper with or encrypt them with ransomware on certain machines.

Production database credentials

Additionally, evidence suggests that the attackers obtained password hashes from the machines, through a simple command, “cat /etc/shadow,” which was executed multiple times in the job history, indicating successful infiltration.

Ray's job history shows that attackers stole passwords and launched a reverse shell.

Ray’s job history shows that attackers stole passwords and launched a reverse shell.

Oligo researchers also identified numerous private SSH keys, which could be exploited to gain access to additional machines using the same VM image template to expand the computational power for crypto-mining campaigns or establish persistence.

AWS cluster machine credentials, which allow connecting to all of the cluster's machines, using SSH.

AWS cluster machine credentials, which allow connecting to all of the cluster’s machines, using SSH.

OpenAI, HuggingFace, and Stripe tokens were also identified, which could be exploited to exhaust the affected company’s credits, infiltrate accounts and override models, execute unauthorized transactions, and even lead to supply chain attacks.

AI model handling a user-submitted query in real time. The model could be abused by an attacker, who could alter customer requests or responses.

Furthermore, attackers could gain access to Cloud Environments (AWS, GCP, Azure, Lambda Labs) from compromised Ray clusters, many of which operated with elevated privileges. These compromised clusters provided attackers access to sensitive cloud services, potentially exposing complete databases, customer data, codebases, artifacts, and secrets.

Additionally, KubernetesAPI access enabled attackers to infect cloud workloads and steal Kubernetes secrets.

Kuberay Operator running with Administrator permissions on the Kubernetes API.

The investigation also revealed Slack tokens; exploitation of these tokens could grant unauthorized access to an organization’s Slack messages and enable attackers to send arbitrary messages, compromising confidentiality and integrity.

The Financial Impact of the ShadowRay Campaign

The compromised machines hold significant financial value, especially considering the scarcity of the GPU models affected. Many of these GPU models are currently unavailable and challenging to acquire through regular channels.

For instance, the A6000 GPUs, as observed in the machines, are out of stock on NVIDIA’s official website. This scarcity amplifies their market value and underscores the substantial financial loss incurred due to their compromise.

nvidia-smi output from a compromised machine

The Timeline and Scope of Cryptomining Activities

Researchers revealed that the first crypto-miner was initiated on February 21, 2024. However, public web intelligence tools indicated that the associated IP had been accepting connections to the target port since September 5, 2023, suggesting a potential pre-disclosure breach.

It is reported that the scale and timeline of these attacks point to the involvement of a sophisticated hacking group.

Among the crypto-mining activities discovered were instances of XMRig miners, some of which operate in-memory without the need for disk downloads, complicating detection and eradication efforts. Notably, the presence of NBMiner and Java-based Zephyr miners was also identified.

The Identity of the Attackers

The command lines used in these attacks contain the distinct username and password of the attacker, along with the server with which it communicates. By scrutinizing the mining pool, researchers successfully pinpointed the attacker’s presence on the leaderboard, shedding light on the identity and activities of the perpetrators behind these crypto-mining operations.

The attackers have achieved a notable rank of 148th out of 3216 miners within the pool, positioning them within the top 5% of miners participating in the mining pool. This underscores the extent of their exploitation efforts and the substantial impact of their activities within the cryptocurrency mining community.

The attacker has achieved 148th place out of 3216.

Establishing Persistent Access with Reverse Shells

Several instances of reverse shells were found, granting attackers the ability to execute arbitrary code within the production environment, posing a significant threat to the security and integrity of the affected infrastructure.

The oldest record of reverse shell on a Ray cluster is on September 5, 2023

Additionally, with further investigation into the domain oast.fun, it was discovered that the attackers are employing the open-source service Interactsh to evade detection.

Using interactsh, the attacker receives out-of-band notifications about DNS queries from clients

The domain oast.fun is one of the public servers the project maintains.

The default page on the public server

The attackers utilized free public servers as a means to evade detection. Upon successful execution of a base64 payload via the Jobs API, a DNS query is triggered from the compromised machine to the attacker-controlled free subdomain. This allowed the attackers to promptly receive notifications containing the IP address of the compromised machine.

An attacker-controlled IP address that served the payload

In summary, the investigation has unveiled a sophisticated and persistent crypto-mining operation, highlighting the critical importance of implementing robust security measures to counter such intrusions effectively.

Privilege Escalation Tactics and Use of Open-Source Script

The investigation uncovered that the attackers attempted to escalate privileges using sudo, which was not available on the attacked machine. Researchers highlight that the attackers exploited the www[.]akuh[.]net service using an open-source repository.

The open-source script used by the attackers.

VirusTotal showed no red flags for the payload, with a detection rate of 0/59.

VirusTotal results for the payload.

Mitigation Strategies to Improve the Safety of Ray Deployments

Below are some of the best practices to secure your Ray deployments against exploitation:

Start Deployment in Secure Environment: Launch Ray deployment in a safe and secure environment to establish a strong security foundation. Apply firewall rules or security groups to effectively prevent unauthorized access attempts.
Add Authorization to Secure Ray Dashboard Port (8265): Focus on authorization mechanisms for the Ray Dashboard port. Deploy a proxy with an authorization layer to control access to the Ray API and allow only authorized personnel.
Implement a Continuous Monitoring Strategy: Stay vigilant on production environments and AI clusters with continuous monitoring. Traditional code scanning and misconfiguration tools may not suffice, necessitating special monitoring and protection mechanisms to effectively detect and prevent potential attacks.

To further enhance Ray deployments’ security, it is advisable to avoid connecting to the 0.0.0.0 IP address to minimize the attack surface. Instead, opt for an IP associated with a specific network interface or trusted private VPC/VPN.

Remediations Against ShadowRay

To secure Rail deployments effectively, prioritize operating within a secure environment by implementing firewall rules and adding authorization to the Ray Dashboard port. Continuously monitor for anomalies and avoid default settings like binding to 0.0.0.0, and leverage tools that improve the security posture of clusters.

You can leverage security tools like SOCRadar’s Vulnerability Intelligence and Attack Surface Management modules to mitigate risks, such as those presented by threats like the ShadowRay campaign. These modules provide continuous monitoring for new threats, vulnerabilities like CVE-2023-48022, and updates related to Ray deployments, facilitating proactive threat response.

Furthermore, SOCRadar’s Digital Risk Protection (DRP) module can help detect misconfigurations, compromised services, and unauthorized access attempts, thereby enhancing the security posture of Ray instances.

To review the details and similar campaigns, you can visit the Campaigns page on the SOCRadar platform.

SOCRadar Campaigns – ShadowRay

You can also read some of the campaigns we have previously written about on SOCRadar Labs, which provides free access to some of SOCRadar’s greatest features. Visit SOCRadar Labs’ Campaigns page here.