ORION / PERCEPTION_LAYER

BROTEUS

Biometric Recognition & Object-Tracking Engagement with Universal Sensing

BROTEUS is a real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.

The entire system runs on CPU at ~21 FPS. No GPU required.

21FPS
CPU Throughput
87%
Confidence
42
Hand Keypoints
35-d
Feature Vector
Project Classification
Perception Pipeline
IN_DEVELOPMENT
Role Perception Layer
Port localhost:8100
Ecosystem ORION
Detector YOLO-World
Depth MiDaS (mono)
Hands MediaPipe
Language Python 3.11
Timeline Jan 2026 - Present
Built By
David Young
+ Swan Yi Htet
01 / Pipeline Overview

What It Does

A camera feed is processed through four parallel subsystems: YOLO-World finds objects the operator has specified, MediaPipe tracks both hands with full 3D skeleton data, a learning-first classifier identifies static hand gestures, and a DTW-based recognizer detects temporal hand animations.

Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs simultaneously in real-time on CPU.

Parallel Subsystems
YOLO-World open-vocab detection
MediaPipe dual-hand 3D
Gesture (35-d) learning-first
Animation (DTW) temporal match
02 / Open-Vocabulary Detection obj_detect.mp4

Object Detection

Real-time object detection with user-driven search list. Classes are added and removed on the fly.

BROTEUS uses YOLO-World, an open-vocabulary detection model. Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.

The system starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists to disk across restarts.

This matters because most detection systems have a hardcoded vocabulary baked into training. BROTEUS has none. The operator decides what exists in the scene.

Confidence 79-87%
Throughput 21 FPS (CPU)
Tracking IoU · Persistent IDs
Future Hardware

The detection backbone is designed as a swappable module, structured for future drop-in of NVIDIA Isaac ROS (RT-DETR, FoundationPose) when GPU hardware becomes available.

03 / Static Gesture Recognition gesture.mp4

Gesture Recognition

Dual-hand gesture recognition with independent left/right tracking. Each hand displays its gesture, action, confidence, and finger states in real-time.

BROTEUS tracks up to two hands simultaneously using MediaPipe's HandLandmarker, giving 21 3D keypoints per hand.

Left and right hands are identified and tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.

Rotation-invariant encoding
Feature Vector Breakdown

The 35-Dimensional Encoding

35-d
per hand pose
5
finger
curl angles
5
tip-to-palm
distances
10
inter-tip
distances
5
z-depth
ratios
4
thumb
proximity
3
palm
normal
3
palm
direction

The palm orientation features are what separate this from typical "is the finger up or down" approaches. When a hand rotates, the curl angles barely change, but the palm normal flips completely. That signal is encoded.

Teaching a Gesture
  1. 01 Hold a hand pose in front of the camera
  2. 02 Press Record to start capturing feature samples
  3. 03 Slowly rotate the hand while holding the pose, capturing across multiple orientations
  4. 04 The UI displays live rotation coverage percentage
  5. 05 Press Stop, name the gesture, assign an ORION action
Classification Strategy

Classification checks learned gestures first (cosine similarity against multi-sample clusters), then falls back to geometric rules for common poses like open palm, fist, or point.

Left and right hands maintain completely separate memory files. No cross-contamination.

04 / Temporal Motion Recognition animation.mp4

Animation Recognition

Recognizing a learned hand animation in real-time. The purple banner indicates a temporal gesture match.

Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin motion. This is a fundamentally different recognition problem.

BROTEUS solves it with Dynamic Time Warping (DTW). Each frame, a 12-dimensional temporal feature vector is extracted (finger curls + palm normal + position + velocity) and pushed into a sliding window covering the last ~3 seconds.

Every few frames, DTW compares that window against all stored animation recordings.

Why DTW

DTW handles speed variation naturally. A fast wave and a slow wave produce different frame counts but share the same underlying motion shape.

DTW warps one sequence onto the other and measures alignment cost. A Sakoe-Chiba band constraint keeps the warping physically reasonable.

Teaching an Animation
  1. 01. Press Record
  2. 02. Perform the motion for 2-3 seconds
  3. 03. Press Stop, name it

Recording the same motion 2-3 times at different speeds improves matching robustness.

05 / Grasp Affordance

Grasp
Intelligence

Clicking a detected object activates focus mode, which computes a grasp affordance heatmap. Every surface point is scored on four criteria.

Scores render as a continuous green-to-red heatmap overlay. Green indicates an ideal contact surface. Red indicates a poor one.

Depth data comes from MiDaS monocular estimation. One RGB camera, no depth sensor needed. Sobel-based surface normals are computed from the depth map for the normal alignment criterion.

01
Normal Alignment
Can a gripper approach from this angle?
02
Depth Consistency
Is this a flat, stable region?
03
Edge Proximity
Distance from object boundary
04
Centroid Balance
Is the grasp centered for stability?
06 / System Architecture

Architecture

BROTEUS Server
FastAPI · Port 8100
YOLO-World
Detection
MediaPipe
Dual Hands
Gesture 35D
Anim 12D + DTW
MiDaS
Depth
IoU Object Tracker
Persistent IDs · Class-vote stability
Grasp Affordance Scorer
WebSocket Frame Streaming
WebSocket / JSON
Live Dashboard
Browser · localhost:8100
07 / Ecosystem Context

Part of ORION

BROTEUS is the perception layer of a modular robotics ecosystem. The target hardware is an SO-ARM 101, a 6-DOF arm where BROTEUS provides perception, CHIRON drives the joints, and DAEDALUS closes the sim-to-real gap.

ATHENA
Navigation
Pathfinding
Terrain
BROTEUS
Perception
Grasp intel
Gestures
CHIRON
Motor cortex
IK solver
Sequencer
DAEDALUS
Physics
discovery
SINDy
RL PIPELINE
PPO / SAC
ONNX
50-200 Hz

BROTEUS sees. ORION decides. CHIRON moves. DAEDALUS calibrates.

08 / Technical Stack

Tech Stack

Detection YOLO-World (yolov8s-worldv2)
Hands MediaPipe HandLandmarker
Depth MiDaS (MiDaS_small)
Server FastAPI + WebSocket
Tracking Custom IoU tracker
Gestures 35-d, cosine similarity
Animations 12-d, DTW + Sakoe-Chiba
Frontend Vanilla JS/CSS