Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746318

Summary

This paper presents OSCAR (Object Status Context Awareness for Recipes), a technical pipeline that uses object status recognition—tracking the condition and transformation of ingredients and tools—to support recipe progress tracking for blind and low vision (BLV) cooks. Unlike existing cooking assistive tools that simply read recipe steps linearly through screen readers or voice assistants, OSCAR reasons about the physical state of the cooking environment by analyzing what has changed visually. The pipeline integrates four components: recipe parsing and object status extraction using GPT-4o, visual alignment between cooking frames and recipe steps using vision-language models (CLIP and SigLIP), similarity metric logging, and a time-causal model that enforces temporal coherence to prevent out-of-order predictions. OSCAR was evaluated on two datasets: 173 instructional cooking videos from the YouCook2 dataset and a novel real-world dataset of 12 non-visual cooking sessions recorded by BLV individuals (6 male, 6 female, average age 43; 4 legally blind, 8 totally blind) cooking in their own kitchens using chest-mounted GoPro cameras. The real-world dataset captures authentic non-visual cooking practices including tactile exploration, tool substitution, spatial memorization, and non-linear workflows that differ substantially from sighted cooking shown in instructional videos.

Key findings

Object status recognition consistently improved step prediction accuracy by over 20% across both datasets and both vision-language models. On instructional videos, OSCAR improved CLIP accuracy from 41.7% to 68.0% and SigLIP from 62.2% to 82.8%. On the real-world non-visual cooking dataset, improvements were equally significant: CLIP rose from 33.7% to 58.4% and SigLIP from 41.9% to 66.7%. The substantial performance drop between instructional and real-world datasets reveals critical challenges for vision-based assistive systems: BLV cooks engage in prolonged tactile exploration and object verification that models misinterpret as task completion; they substitute tools (butter knives for spatulas, fingers for spreading); they pre-prepare ingredients and keep previously processed items in view, confusing frame-by-frame models; and chest-mounted cameras frequently capture suboptimal angles with poor lighting and occlusion. The paper identifies five design considerations: systems must accommodate implicit preparatory tasks not in recipes, distinguish exploratory touch from actual cooking progress, handle variable lighting conditions, support hands-free robust camera framing, and reason about temporal context when pre-prepared ingredients remain in the visual field.

Relevance

This work makes a compelling case that tracking what happens to ingredients (their status changes) is more robust than tracking what tools are used or what actions are performed—a principle with broad implications for assistive technology beyond cooking. The concept of object status as a "universal design primitive" is particularly powerful, applicable to domains like crafting, cleaning, and home repair where progress is defined by material transformation. For practitioners developing AI-powered assistive tools, the dramatic performance gap between curated instructional data and real-world BLV user data (baseline accuracy dropping from 62% to 42%) serves as a stark warning about the limitations of training and evaluating on idealized datasets. The publicly released non-visual cooking dataset provides a valuable resource for the accessibility research community. The design considerations around camera placement, exploratory behaviors, and temporal reasoning offer concrete guidance for building more resilient context-aware assistive systems.

Tags: blind and low vision · cooking accessibility · context awareness · object recognition · computer vision · vision-language models · assistive technology · recipe tracking · non-visual interaction