What is FACS and how does EchoDepth use it?

FACS (Facial Action Coding System) is the scientific gold standard for facial expression analysis, developed by Paul Ekman and Wallace Friesen. It describes every visible facial movement through 44 numbered Action Units (AUs). EchoDepth tracks all 44 AUs per frame in real time — and because AUs are largely involuntary, they provide a reliable emotional signal independent of self-report.

VAD stands for Valence (positive to negative emotional state), Arousal (calm to high-activation) and Dominance (submissive to in-control). EchoDepth outputs continuous 0.0–1.0 scores across all three dimensions for every frame analysed, giving a rich, multidimensional picture of emotional state.

Does EchoDepth work across different ethnicities and cultures?

EchoDepth was trained across 6 countries and 14 cultural cohorts specifically to reduce demographic bias. Cultural expression variance is modelled rather than averaged. Cavefish will not deploy EchoDepth in contexts where bias could produce discriminatory outcomes.

The Technology

Making the invisible
measurable.

EchoDepth analyses text, image, voice and video — adapting to your interaction format, whether phone calls, video interviews, written correspondence or document review. It translates involuntary emotional signals into structured, quantified data in real time, at scale, with no specialist hardware required.

44 Action Units VAD Model GDPR Compliant Text · Image · Voice · Video

Facial Action Coding System

44 Action Units.
One universal language.

The Facial Action Coding System (FACS) is the scientific gold standard for facial expression analysis — a taxonomy developed by Paul Ekman and Wallace Friesen that describes every visible movement of the human face through discrete, numbered Action Units (AUs).

EchoDepth tracks all 44 observable Action Units per frame, per person. Because AUs are involuntary — many cannot be consciously controlled — they provide a reliable signal that is independent of self-report.

AU1 & AU4: Inner brow raise / brow lowerer — core stress markers
AU6 & AU12: Cheek raiser / lip corner pull — genuine vs masked emotion
AU17 & AU24: Chin raiser / lip press — suppression and withholding signals
Temporal coherence: scored across the full session, not single frames

Example AU activation pattern

AU1 — Inner brow raise0.72

AU4 — Brow lowerer0.64

AU6 — Cheek raiser0.18

AU17 — Chin raiser0.55

AU24 — Lip press0.61

AU pattern consistent with suppressed stress and cognitive load. Session flag: elevated.

Output Model

Three dimensions.
One complete picture.

EchoDepth outputs a continuous VAD score — Valence, Arousal, Dominance — the three-dimensional model of emotional state that underpins modern affective computing.

Valence

The positive-to-negative dimension. High valence indicates a positive, comfortable emotional state. Low valence indicates distress, displeasure or anxiety.

In fintech: A valence drop during a specific question in a claims interview may indicate discomfort with that topic.

Arousal

The calm-to-excited dimension. Elevated arousal indicates heightened physiological activation — which may reflect stress, urgency, fear or excitement.

In fintech: Sustained high arousal in a mortgage interview correlates with elevated cognitive load — a potential indicator of rehearsed or constructed responses.

Dominance

The submissive-to-in-control dimension. High dominance indicates confidence and control. Low dominance indicates a sense of vulnerability or powerlessness.

In fintech: A sudden dominance drop mid-session can indicate the subject has encountered a question they were not prepared for.

Multimodal Analysis

Four signal streams.
One emotional picture.

EchoDepth combines four input modalities to build the most complete picture of emotional state — adapting to the interaction format, whether video call, phone call, or recorded session:

📹

Video / Facial

44-AU FACS analysis at up to 30fps. Temporal coherence scoring across the full session window. Primary modality for video calls and recorded interviews.

🎙️

Voice / Prosody

Pitch, rate, energy and micro-pause analysis. Detects vocal stress markers independent of content. Primary modality for phone-based vulnerability detection — where most collections and complaint interactions occur.

🖼️

Image / Still Frame

Single-frame AU analysis for non-video contexts — document verification, ID check photos, or post-session review of captured stills.

💬

Text / Linguistic

Sentiment, hedging, temporal inconsistency and confidence markers in transcribed speech or written correspondence.

Bias Reduction

Trained across
cultures, not just data.

Many emotion AI systems fail outside the demographic of their training data. EchoDepth was deliberately built to avoid this.

Training data collected across 6 countries
14 cultural cohorts represented in the model
Active bias auditing — cultural expression variance is modelled, not averaged
No reliance on posed expression datasets
Validated on spontaneous, naturalistic video — not lab conditions

EchoDepth will not be deployed in a context where cultural expression bias would produce discriminatory outcomes. Read our full methodology →

Privacy by Design

No biometric data stored.
No exceptions.

🔒

No raw video retained

Video is processed in memory. Frames are never stored. Only VAD scores and AU activations are output.

🇬🇧

GDPR compliant

Designed for UK and EU regulatory environments. FCA Regulatory Sandbox participant. Data residency options available.

🏭

On-device processing

Edge deployment option for organisations where video data cannot leave the premises. Full API feature parity.

✅

ISO 9001 infrastructure

Built on ISO 9001 and Cyber Essentials certified infrastructure designed for regulated environments.

Ready to go deeper?

Explore our methodology, the API documentation, or talk to the team about a proof of concept.

Methodology Developer Docs

44 Action Units.One universal language.

Three dimensions.One complete picture.

Valence

Arousal

Dominance

Four signal streams.One emotional picture.

Trained acrosscultures, not just data.

No biometric data stored.No exceptions.

No raw video retained

GDPR compliant

On-device processing

ISO 9001 infrastructure

Ready to go deeper?

44 Action Units.
One universal language.

Three dimensions.
One complete picture.

Four signal streams.
One emotional picture.

Trained across
cultures, not just data.

No biometric data stored.
No exceptions.