Biometric Recognition & Object-Tracking Engagement with Universal Sensing
BROTEUS is a real-time vision system that detects objects, tracks hands, recognizes gestures, and understands hand animations, all running simultaneously in a single pipeline.
The entire system runs on CPU at ~21 FPS. No GPU required.
A camera feed is processed through four parallel subsystems: YOLO-World finds objects the operator has specified, MediaPipe tracks both hands with full 3D skeleton data, a learning-first classifier identifies static hand gestures, and a DTW-based recognizer detects temporal hand animations.
Clicking a detected object triggers a grasp affordance heatmap showing optimal contact surfaces. Every subsystem runs simultaneously in real-time on CPU.
Real-time object detection with user-driven search list. Classes are added and removed on the fly.
BROTEUS uses YOLO-World, an open-vocabulary detection model. Traditional YOLO is locked to 80 COCO classes. YOLO-World accepts arbitrary text queries at runtime.
The system starts with zero classes. The operator adds object names through the UI, and BROTEUS begins searching for them. Removing a class is a single click. The search list persists to disk across restarts.
This matters because most detection systems have a hardcoded vocabulary baked into training. BROTEUS has none. The operator decides what exists in the scene.
The detection backbone is designed as a swappable module, structured for future drop-in of NVIDIA Isaac ROS (RT-DETR, FoundationPose) when GPU hardware becomes available.
Dual-hand gesture recognition with independent left/right tracking. Each hand displays its gesture, action, confidence, and finger states in real-time.
BROTEUS tracks up to two hands simultaneously using MediaPipe's HandLandmarker, giving 21 3D keypoints per hand.
Left and right hands are identified and tracked independently with separate classifiers, separate memory, and separate state. Each hand pose is compressed into a 35-dimensional feature vector.
The palm orientation features are what separate this from typical "is the finger up or down" approaches. When a hand rotates, the curl angles barely change, but the palm normal flips completely. That signal is encoded.
Classification checks learned gestures first (cosine similarity against multi-sample clusters), then falls back to geometric rules for common poses like open palm, fist, or point.
Left and right hands maintain completely separate memory files. No cross-contamination.
Recognizing a learned hand animation in real-time. The purple banner indicates a temporal gesture match.
Static gestures capture frozen poses. Animations capture movements: a beckoning curl, a wave, a circular spin motion. This is a fundamentally different recognition problem.
BROTEUS solves it with Dynamic Time Warping (DTW). Each frame, a 12-dimensional temporal feature vector is extracted (finger curls + palm normal + position + velocity) and pushed into a sliding window covering the last ~3 seconds.
Every few frames, DTW compares that window against all stored animation recordings.
DTW handles speed variation naturally. A fast wave and a slow wave produce different frame counts but share the same underlying motion shape.
DTW warps one sequence onto the other and measures alignment cost. A Sakoe-Chiba band constraint keeps the warping physically reasonable.
Recording the same motion 2-3 times at different speeds improves matching robustness.
Clicking a detected object activates focus mode, which computes a grasp affordance heatmap. Every surface point is scored on four criteria.
Scores render as a continuous green-to-red heatmap overlay. Green indicates an ideal contact surface. Red indicates a poor one.
Depth data comes from MiDaS monocular estimation. One RGB camera, no depth sensor needed. Sobel-based surface normals are computed from the depth map for the normal alignment criterion.
BROTEUS is the perception layer of a modular robotics ecosystem. The target hardware is an SO-ARM 101, a 6-DOF arm where BROTEUS provides perception, CHIRON drives the joints, and DAEDALUS closes the sim-to-real gap.
BROTEUS sees. ORION decides. CHIRON moves. DAEDALUS calibrates.