Generating Descriptions for Videos based on the Interaction between a Human and Objects considering their Positional Relationship

Rino Urushihara

1:40 PM - 2:00 PM

[1J1-02] Generating Descriptions for Videos based on the Interaction between a Human and Objects considering their Positional Relationship

〇Rino Urushihara¹, Ichiro Kobayashi¹ (1. Ochanomizu University)

Keywords:natural language generation, object detection, human pose estimation

近年，監視カメラによる不審者の挙動の把握や高齢者の見守り，スポーツの実況中継など，人の動作を言葉によって報告する技術の必要性が高まっており，深層学習を用いた画像や動画像の言語化に関する研究が盛んに行われている．しかし，画像や動画特徴量から直接文章を生成する手法が多く，人間が実際に画像や動画像を見て認識するような事象，特に人の動作について正しく捉えて言語化する手法はほとんどない．そのため，本研究では，深層学習を用いて，動画像中の人の姿勢と物体を正しく捉えた説明文生成に取り組む．具体的には，動画像のフレームごとに人の姿勢情報を抽出し時系列情報として，動作を表す単語を選択する処理と，フレームごとに物体を検出する処理を合わせ，それぞれの処理において得られた結果から動画像説明文生成を行う．

Presentation information

[1J1] [General Session] 9. NLP / IR

[1J1-02] Generating Descriptions for Videos based on the Interaction between a Human and Objects considering their Positional Relationship