In 2019, attackers called the CEO of a UK-based energy company’s subsidiary. The voice on the line was indistinguishable from the parent company’s chief executive — same accent, same cadence, same conversational rhythm. The caller instructed the subsidiary CEO to urgently wire €220,000 (approximately $243,000 USD) to a Hungarian supplier. The wire went through. The voice was AI-generated.

That incident was the opening shot of a category of fraud that has since grown into a multi-billion-dollar threat. In 2024, a Hong Kong finance employee watched a video call in which his CFO and several colleagues authorized a $25.6 million transfer. Every person on that call was a deepfake. The money was gone within hours.

This post breaks down how deepfake CEO fraud works at a technical level — the tooling, the attack chain, how to detect synthesized media, and the procedural and technical controls that can stop a wire transfer before it leaves your bank.


The Threat Landscape: BEC Meets Generative AI

Business Email Compromise (BEC) has been the FBI’s highest-loss cybercrime category for years running. The FBI IC3 2023 Internet Crime Report recorded adjusted losses of $2.94 billion from BEC incidents — and that figure reflects only reported cases. The real number is higher.

Traditional BEC relies on impersonating an executive via email: spoofed domains, lookalike addresses, or compromised inboxes. Defenders adapted by training employees to call and verify. Deepfake fraud attacks that verification step directly.

The technology enabling this escalation is no longer experimental:

  • Voice cloning: Services like ElevenLabs, Play.ht, and open-source frameworks like Coqui TTS can produce highly realistic voice synthesis from a short audio sample.
  • Real-time voice conversion: Tools like RVC (Retrieval-based Voice Conversion) can transform a live voice in near-real-time, allowing attackers to speak naturally while the output sounds like the target.
  • Deepfake video: Face-swap and full-head synthesis tools can animate a static image or swap a face into a live video stream, as demonstrated in the Hong Kong incident.

The asymmetry is stark: an attacker needs 30–60 seconds of clean audio (available from earnings calls, YouTube interviews, conference talks, or LinkedIn videos) to build a working voice clone. The target employee needs to make the right call in seconds under social pressure from what sounds like their boss.


Real Incidents

2019 — UK Energy Firm, $243,000

In March 2019, the CEO of a UK energy company’s German parent organization received what he believed was a call from the group’s chief executive in Germany. The caller instructed him to wire €220,000 to a Hungarian supplier “within the hour” for a confidential business reason. The money was transferred. The call was AI-synthesized voice, identified afterward by a cybersecurity insurer (Euler Hermes) that covered the loss. The attackers called back twice — once to claim reimbursement was coming (to prevent the wire being recalled) and a second time from an Austrian phone number, at which point the CEO grew suspicious. By then, the funds had been forwarded to Mexico.

2024 — Hong Kong Multinational, $25.6 Million

In January 2024, a finance employee at the Hong Kong office of an unnamed multinational received what appeared to be a phishing email requesting a confidential transaction. Skeptical, the employee attended a video conference call where he saw and heard colleagues — including the company’s CFO — authorize the transfers. All participants except the employee were deepfakes generated from publicly available video footage. The employee processed 15 transactions totaling HK$200 million (~$25.6 million USD). Hong Kong police made arrests in the case and disclosed it publicly in February 2024.


The Attack Chain

Step 1: Target Selection and OSINT

Attackers identify the target organization and map the approval chain for wire transfers: who authorizes, who executes, and what the typical escalation path looks like. This is frequently available through LinkedIn (job titles, reporting structures), company websites (executive team pages), and press releases.

The voice target — typically a CEO, CFO, or senior executive — is identified. Source audio is gathered from:

  • Earnings call recordings (IR pages, investor.gov, seeking alpha)
  • YouTube interviews or conference keynotes
  • Podcast appearances
  • LinkedIn or Twitter/X video posts

A 60-second clip of clean speech is sufficient for a basic clone. Longer samples improve naturalness.

Step 2: Voice Sample Processing and Clone Building

The attacker prepares the audio sample:

1# Trim silence and normalize audio using ffmpeg
2ffmpeg -i raw_ceo_interview.mp4 -vn -acodec pcm_s16le -ar 22050 -ac 1 raw_audio.wav
3ffmpeg -i raw_audio.wav -af "silenceremove=start_periods=1:start_silence=0.5:start_threshold=-50dB" clean_audio.wav
4
5# Optional: denoise with SoX
6sox clean_audio.wav denoised_audio.wav noisered noise.prof 0.21

The processed audio is used to build a speaker embedding in a voice cloning framework. The following is a conceptual demonstration only — showing that the technology exists and is accessible, not a how-to:

 1# Conceptual example: how open-source TTS voice cloning works
 2# This is representative of documented open-source frameworks (e.g., Coqui TTS)
 3# NOT a functional exploit — shown to illustrate technology accessibility
 4
 5# A typical voice cloning API call structure:
 6# 1. Load a pre-trained multi-speaker TTS model
 7# 2. Encode the reference speaker audio to a voice embedding
 8# 3. Synthesize new speech in the reference speaker's voice
 9
10# from TTS.api import TTS
11# model = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts")
12# model.tts_to_file(
13#     text="Please wire $200,000 to the account I'm sending you now.",
14#     speaker_wav="clean_audio.wav",  # 30-60 seconds of target voice
15#     language="en",
16#     file_path="synthesized_ceo.wav"
17# )
18
19# Result: a .wav file in the target executive's voice
20# The full pipeline from audio sample to synthesized output takes minutes
21print("Voice cloning is accessible with minimal skill and near-zero cost.")
22print("The technology gap between attacker and defender is narrow.")

Step 3: Delivery — The Urgent Phone Call

The attack is typically executed as a spoofed phone call (caller ID spoofing is trivial with VoIP services) or a fabricated conference call link. The social engineering script follows a consistent pattern:

  • Urgency: “I need this done in the next 30 minutes before the deal closes.”
  • Secrecy: “This is confidential — don’t discuss it with anyone else yet.”
  • Authority pressure: The caller is the employee’s direct superior or a senior executive.
  • Isolation: Instructions to avoid normal approval channels.

The target employee authorizes a wire transfer to an attacker-controlled account, often structured to stay under SWIFT monitoring thresholds.

Step 4: Money Movement

Funds are typically moved through layered accounts — from the initial transfer to accounts in jurisdictions with limited recovery cooperation, then quickly converted or forwarded. Recall windows are narrow (often 24–72 hours on international wires before the funds are moved beyond reach).


Detection: Identifying Synthesized Audio and Video

Spectrogram Analysis

Synthesized speech exhibits characteristic artifacts that differ from natural human speech:

1# Generate a spectrogram from an audio file using SoX
2sox suspicious_voicemail.wav -n spectrogram -o spectrogram.png
3
4# Install and use spek (GUI spectrogram analyzer) for visual inspection
5# Key artifacts to look for:
6# - Abrupt frequency cutoffs (unnatural for human voice)
7# - Absence of breath sounds between sentences
8# - Unnatural formant transitions at word boundaries
9# - Uniform energy distribution lacking natural variation

In a genuine voice recording, you will see irregular energy patterns, sibilance variation, and breath noise. TTS-generated audio often shows unnaturally clean spectrograms with mechanical regularity.

Metadata Analysis of Suspicious Audio/Video Files

 1# Inspect metadata of received video/audio files
 2ffprobe -v quiet -print_format json -show_format -show_streams suspicious_video.mp4
 3
 4# Fields to examine:
 5# - encoder: unusual encoding software (e.g., deepfake rendering tools)
 6# - creation_time: mismatched with claimed recording time
 7# - Duration vs. stated meeting time
 8# - Audio sample rate: synthetic audio often uses non-standard rates (22050 Hz vs. natural 44100 Hz)
 9
10# Check for inconsistent video/audio sync (a common deepfake artifact)
11ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate suspicious_video.mp4
12# Deepfake video may have inconsistent frame rates or dropped frames at face boundary edges

Deepfake Detection Tools

Several dedicated tools exist for media authentication:

  • Microsoft Video Authenticator: Analyzes images/video for blending artifacts at facial boundaries, produces a confidence score.
  • Sensity AI (formerly Deeptrace): Commercial platform for deepfake video and audio detection, used by media organizations and financial institutions.
  • Resemble Detect: Audio-specific deepfake detection API, can classify whether audio was generated by known TTS systems.
  • FakeCatcher (Intel): Real-time deepfake video detection based on photoplethysmography (blood flow signals that are absent in synthesized faces).

Voice Biometric Verification Systems

Financial institutions and contact centers increasingly deploy voice biometrics (e.g., Nuance Gatekeeper, Pindrop) that create a voiceprint baseline for enrolled users and flag deviations consistent with synthesis:

  • Liveness detection (challenges that require spontaneous responses)
  • Acoustic environment analysis (inconsistent background noise patterns)
  • Voiceprint matching against enrolled baseline

Defense and Mitigation Controls

Procedural Controls (Highest Impact)

1. Mandatory Out-of-Band Callback for Wire Transfers

Any wire transfer request received by phone, email, or any channel must be verified by calling the requester back on their known, company-directory phone number — not any number provided in the request. This single control would have prevented both the 2019 UK energy firm fraud and countless BEC incidents.

2. Dual Authorization Thresholds

Implement tiered authorization requiring two independent approvals for transfers above defined thresholds (e.g., $10,000, $50,000, $100,000). No single employee should be able to authorize a large transfer based solely on a phone request.

3. Shared Company Code Words

Establish a per-employee or per-team verbal codeword known only to internal staff, rotated quarterly. Any executive calling to authorize an urgent transaction should be asked to provide the current codeword. This is simple, zero-cost, and immediately effective.

4. Time Delay and Cooling Period

Implement a mandatory 24-hour delay on first-time payees or any transfer to a new account number. Attackers rely on urgency — a delay breaks the social engineering pressure.

Technical Controls

5. Caller ID Verification

Deploy STIR/SHAKEN compliance verification for inbound calls (required by the FCC for US carriers). Calls from spoofed numbers will display an attestation failure or “Likely Spam” label. This does not stop VoIP-originated calls but raises friction.

6. Voice Biometrics for Internal Escalation Lines

For high-privilege actions like wire transfers, require callers to authenticate via an enrolled voice biometric before proceeding. Synthetic voices that deviate from the enrolled voiceprint trigger a hold.

7. Email Authentication (SPF, DKIM, DMARC)

Ensure your domain has strict DMARC enforcement (p=reject) to prevent spoofing. This does not stop deepfake calls but closes the email vector of the same attack pattern.

8. Security Awareness Training with Live Deepfake Demos

Generic phishing training is insufficient. Employees must hear and see a deepfake of a voice they recognize — ideally a simulated version of their own manager — to understand the quality of current synthesis. Vendors like KnowBe4 and Proofpoint offer voice deepfake simulation exercises.

9. Incident Response Runbook for Suspected Fraud

Define a clear playbook: if a wire transfer has been executed based on a suspicious request, immediate steps are (1) contact the sending bank’s wire recall team, (2) contact the FBI IC3 and file a complaint, (3) contact the recipient bank’s fraud department. Recovery is possible within the first 24–72 hours if escalated quickly.


MITRE ATT&CK and MITRE ATLAS Mapping

TechniqueIDDescription
Phishing for InformationT1598OSINT gathering for voice samples
ImpersonationT1656Executive voice/face impersonation
Audio/Video DeepfakeT1585.003Synthetic media for social engineering
BEC / Financial FraudT1659Wire transfer fraud via impersonation


References