ORION / PERCEPTION_LAYER

BROTEUS

Biometric Recognition & Object-Tracking Engagement with Universal Sensing

BROTEUS is a real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.

The entire system runs on CPU at ~21 FPS. No GPU required.

Source on GitHub View Live Detection

21FPS

CPU Throughput

87%

Confidence

Hand Keypoints

35-d

Feature Vector

Project Classification

Perception Pipeline

IN_DEVELOPMENT

Role Perception Layer

Port localhost:8100

Ecosystem ORION

Detector YOLO-World

Depth MiDaS (mono)

Hands MediaPipe

Language Python 3.11

Timeline Jan 2026 - Present

Built By

David Young

+ Swan Yi Htet

01 / Pipeline Overview

What It Does

A camera feed is processed through four parallel subsystems: YOLO-World finds objects the operator has specified, MediaPipe tracks both hands with full 3D skeleton data, a learning-first classifier identifies static hand gestures, and a DTW-based recognizer detects temporal hand animations.

Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs simultaneously in real-time on CPU.

Parallel Subsystems

YOLO-World open-vocab detection

MediaPipe dual-hand 3D

Gesture (35-d) learning-first

Animation (DTW) temporal match

02 / Open-Vocabulary Detection obj_detect.mp4

Object Detection

Real-time object detection with user-driven search list. Classes are added and removed on the fly.

BROTEUS uses YOLO-World, an open-vocabulary detection model. Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.

The system starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists to disk across restarts.

This matters because most detection systems have a hardcoded vocabulary baked into training. BROTEUS has none. The operator decides what exists in the scene.

Confidence 79-87%

Throughput 21 FPS (CPU)

Tracking IoU · Persistent IDs

Future Hardware

The detection backbone is designed as a swappable module, structured for future drop-in of NVIDIA Isaac ROS (RT-DETR, FoundationPose) when GPU hardware becomes available.

03 / Static Gesture Recognition gesture.mp4

Gesture Recognition

Dual-hand gesture recognition with independent left/right tracking. Each hand displays its gesture, action, confidence, and finger states in real-time.

BROTEUS tracks up to two hands simultaneously using MediaPipe's HandLandmarker, giving 21 3D keypoints per hand.

Left and right hands are identified and tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.

Rotation-invariant encoding

Feature Vector Breakdown

The 35-Dimensional Encoding

35-d

per hand pose

finger
curl angles

tip-to-palm
distances

inter-tip
distances

z-depth
ratios

thumb
proximity

palm
normal

palm
direction

The palm orientation features are what separate this from typical "is the finger up or down" approaches. When a hand rotates, the curl angles barely change, but the palm normal flips completely. That signal is encoded.

Teaching a Gesture

01 Hold a hand pose in front of the camera
02 Press Record to start capturing feature samples
03 Slowly rotate the hand while holding the pose, capturing across multiple orientations
04 The UI displays live rotation coverage percentage
05 Press Stop, name the gesture, assign an ORION action

Classification Strategy

Classification checks learned gestures first (cosine similarity against multi-sample clusters), then falls back to geometric rules for common poses like open palm, fist, or point.

Left and right hands maintain completely separate memory files. No cross-contamination.

04 / Temporal Motion Recognition animation.mp4

Animation Recognition

Recognizing a learned hand animation in real-time. The purple banner indicates a temporal gesture match.

Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin motion. This is a fundamentally different recognition problem.

BROTEUS solves it with Dynamic Time Warping (DTW). Each frame, a 12-dimensional temporal feature vector is extracted (finger curls + palm normal + position + velocity) and pushed into a sliding window covering the last ~3 seconds.

Every few frames, DTW compares that window against all stored animation recordings.

Why DTW

DTW handles speed variation naturally. A fast wave and a slow wave produce different frame counts but share the same underlying motion shape.

DTW warps one sequence onto the other and measures alignment cost. A Sakoe-Chiba band constraint keeps the warping physically reasonable.

Teaching an Animation

01. Press Record
02. Perform the motion for 2-3 seconds
03. Press Stop, name it

Recording the same motion 2-3 times at different speeds improves matching robustness.

05 / Grasp Affordance

Grasp
Intelligence

Clicking a detected object activates focus mode, which computes a grasp affordance heatmap. Every surface point is scored on four criteria.

Scores render as a continuous green-to-red heatmap overlay. Green indicates an ideal contact surface. Red indicates a poor one.

Depth data comes from MiDaS monocular estimation. One RGB camera, no depth sensor needed. Sobel-based surface normals are computed from the depth map for the normal alignment criterion.

Normal Alignment

Can a gripper approach from this angle?

Depth Consistency

Is this a flat, stable region?

Edge Proximity

Distance from object boundary

Centroid Balance

Is the grasp centered for stability?

06 / System Architecture

Architecture

BROTEUS Server

FastAPI · Port 8100

YOLO-World

Detection

MediaPipe

Dual Hands

Gesture 35D

Anim 12D + DTW

MiDaS

Depth

IoU Object Tracker

Persistent IDs · Class-vote stability

Grasp Affordance Scorer

WebSocket Frame Streaming

WebSocket / JSON

Live Dashboard

Browser · localhost:8100

07 / Ecosystem Context

Part of ORION

BROTEUS is the perception layer of a modular robotics ecosystem. The target hardware is an SO-ARM 101, a 6-DOF arm where BROTEUS provides perception, CHIRON drives the joints, and DAEDALUS closes the sim-to-real gap.

ATHENA

Navigation
Pathfinding
Terrain

BROTEUS

Perception
Grasp intel
Gestures

CHIRON

Motor cortex
IK solver
Sequencer

DAEDALUS

Physics
discovery
SINDy

RL PIPELINE

PPO / SAC
ONNX
50-200 Hz

BROTEUS sees. ORION decides. CHIRON moves. DAEDALUS calibrates.

BROTEUS

What It Does

Object Detection

Gesture Recognition

The 35-Dimensional Encoding

Animation Recognition

GraspIntelligence

Architecture

Part of ORION

Tech Stack

Grasp
Intelligence