Analyze Diet
Sensors (Basel, Switzerland)2026; 26(7); 2202; doi: 10.3390/s26072202

Multi-Modal Feature Fusion and Hierarchical Classification for Automated Equine-Human Interaction Behavior Recognition.

Abstract: Automated recognition of equine-human interaction behaviors from video represents a significant challenge in computational ethology, with critical applications spanning animal welfare assessment, equine-assisted services evaluation, and safety monitoring in equestrian environments. Existing approaches to animal behavior recognition typically focus on single species in isolation, rely solely on facial expression analysis while ignoring full-body posture, or employ flat classification architectures that fail under the severe class imbalances characteristic of naturalistic behavioral datasets. Furthermore, no prior framework integrates simultaneous analysis of both human and equine body language for cross-species interaction classification. This paper presents a novel hierarchical classification framework integrating multi-modal computer vision features to distinguish behavioral states during horse-human encounters. Our methodology employs three complementary feature extraction pipelines: YOLOv8 for spatial relationship modeling, MediaPipe for human postural analysis, and AP-10K for equine body language interpretation. From 28 annotated interaction videos comprising 50,270 temporal samples across five horse breeds, we extract 35 discriminative features capturing proximity dynamics, body orientation, and species-specific behavioral indicators. To address severe class imbalance (18.3:1 ratio between affiliative and avoidant categories), we implement cost-sensitive gradient boosting with automatic class weight optimization within a two-stage hierarchical architecture. The first stage classifies interactions into three parent categories (affiliative, neutral, avoidant) achieving 73.2% balanced accuracy, while stage two discriminates six fine-grained sub-behaviors achieving 88.5% balanced accuracy (under oracle parent-category routing; cascaded end-to-end performance is 62.9% balanced accuracy due to Stage 1 error propagation, identifying parent classification as the primary bottleneck). Notably, our system achieves 85.0% recall on safety-critical avoidant behaviors despite their representation of only 3.8% of the dataset. Extensive ablation studies demonstrate that equine pose features contribute most critically to classification performance, while comprehensive cross-validation analysis confirms model robustness across diverse interaction contexts. The proposed framework establishes the first systematic multimodal cross-species behavioral assessment pipeline in human-animal interaction research, with direct implications for improving equine welfare monitoring and rider safety protocols.
Publication Date: 2026-04-02 PubMed ID: 41977987DOI: 10.3390/s26072202Google Scholar: Lookup
The Equine Research Bank provides access to a large database of publicly available scientific literature. Inclusion in the Research Bank does not imply endorsement of study methods or findings by Mad Barn.
  • Journal Article

Summary

This research summary has been generated with artificial intelligence and may contain errors and omissions. Refer to the original study to confirm details provided. Submit correction.

Overview

  • This research focuses on developing an automated system that recognizes interaction behaviors between horses and humans using video data.
  • The system integrates multiple types of computer vision features and a hierarchical classification framework to accurately identify complex behaviors despite challenges like class imbalance and cross-species interaction.

Research Background and Motivation

  • Automated recognition of animal behavior from videos is important for applications such as animal welfare, service animal evaluation, and safety monitoring in equestrian activities.
  • Previous works mainly emphasize:
    • Recognition of single-species behaviors (either human or animal alone).
    • Analysis based primarily on facial expressions, neglecting full body posture.
    • Flat classification models which struggle with natural datasets containing very uneven distributions of behavior classes.
  • No prior framework focuses on simultaneous, cross-species analysis of human and horse body language during direct interactions.

Methodology

  • The paper proposes a novel hierarchical classification framework leveraging multimodal computer vision features to recognize horse-human interaction behaviors.
  • Three distinct feature extraction pipelines are combined:
    • YOLOv8: To model spatial relationships between human and horse, capturing proximity and positioning in the scene.
    • MediaPipe: To analyze human body posture, capturing their movements and gestures.
    • AP-10K: To interpret equine body language by estimating horse poses and movements.
  • Data specifics:
    • 28 annotated videos of horse-human interactions were used.
    • The dataset includes 50,270 temporal samples from five different horse breeds.
    • From these samples, 35 discriminative features were extracted, describing proximity dynamics, body orientations, and species-specific behavioral signs.
  • Challenges addressed:
    • The dataset has severe class imbalance: affiliative behaviors are 18.3 times more frequent than avoidant behaviors.
    • A cost-sensitive gradient boosting algorithm with automatic class weight adjustments was implemented to mitigate this imbalance.
  • Classification Framework:
    • It is a two-stage hierarchical model:
      • Stage 1: Classifies behaviors into three broad parent categories—affiliative, neutral, and avoidant.
      • Stage 2: Further distinguishes six specific sub-behaviors within the parent categories.

Results and Performance

  • Stage 1 (parent category classification) achieved 73.2% balanced accuracy.
  • Stage 2 (fine-grained sub-behavior classification) achieved 88.5% balanced accuracy with oracle-level routing (meaning when given perfect parent category input).
  • Cascading both stages end-to-end resulted in 62.9% balanced accuracy, with the main error source identified as the initial parent classification bottleneck.
  • The system achieved an 85.0% recall rate for safety-critical avoidant behaviors, despite these being only 3.8% of the dataset—highlighting successful recognition of rare but crucial events.
  • Ablation studies revealed that equine pose features from AP-10K contributed most significantly to the classification accuracy.
  • Cross-validation tests confirmed robust performance across diverse contexts and horse breeds, indicating good generalizability.

Contributions and Significance

  • This study presents the first comprehensive multimodal and cross-species framework for automatic behavioral analysis of horse-human interactions.
  • The hierarchical two-stage classification effectively handles severe class imbalance and complex behavior taxonomies.
  • Integrating human and equine body language cues provides a richer, more nuanced understanding of the interaction dynamics than single-species or single-modality approaches.
  • Practical implications include:
    • Enhanced animal welfare monitoring by automatically detecting subtle affiliative or avoidant signals.
    • Improved safety protocols for riders and handlers through early identification of potentially unsafe behaviors.
    • Potential to aid in equine-assisted therapy and service animal program evaluations by providing objective interaction assessments.

Summary

  • The paper introduces a novel, multimodal, hierarchical classification system that effectively recognizes horse-human interaction behaviors from video data.
  • It overcomes previous limitations by combining spatial, human posture, and equine pose features and mitigating class imbalance through cost-sensitive learning.
  • Results demonstrate the method’s robustness and applicability for real-world ethology and equestrian safety use cases.

Cite This Article

APA
Arora S, Kieson E, Rudd C, Gloor PA. (2026). Multi-Modal Feature Fusion and Hierarchical Classification for Automated Equine-Human Interaction Behavior Recognition. Sensors (Basel), 26(7), 2202. https://doi.org/10.3390/s26072202

Publication

ISSN: 1424-8220
NlmUniqueID: 101204366
Country: Switzerland
Language: English
Volume: 26
Issue: 7
PII: 2202

Researcher Affiliations

Arora, Samierra
  • System Design & Management, Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
Kieson, Emily
  • Equine International, Boston, MA 02115, USA.
Rudd, Christine
  • Equine International, Boston, MA 02115, USA.
Gloor, Peter A
  • System Design & Management, Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
  • Cologne Institute for Information Sciences, University of Cologne, 50923 Cologne, Germany.

MeSH Terms

  • Horses / physiology
  • Animals
  • Humans
  • Behavior, Animal / physiology
  • Human-Animal Interaction
  • Algorithms
  • Video Recording

Citations

This article has been cited 0 times.