Artigo Produção Nacional Revisado por pares

Harnessing high-level concepts, visual, and auditory features for violence detection in videos

2021; Elsevier BV; Volume: 78; Linguagem: Inglês

10.1016/j.jvcir.2021.103174

ISSN

1095-9076

Autores

Bruno Malveira Peixoto, Bahram Lavi, Zanoni Dias, Anderson Rocha,

Tópico(s)

Video Surveillance and Tracking Methods

Resumo

In detecting sensitive media, violence is one of the hardest to define objectively, and thus, a significant challenge to detect automatically. While many studies were conducted in detecting aspects of violence, very few try to approach the general concept. We propose a method that aims to enable machines to understand a high-level concept of violence by first breaking it down into smaller, more objective ones, such as fights, explosions, blood, and gunshots, to combine them later, leading to a better understanding of the scene. For this, we leverage characteristics of each individual sub-concept of violence (relying upon custom-tailored convolutional neural networks) to guide how they should be described. A fight scene should incorporate temporal features that a scene with blood does not need to describe. A scene with explosions or gunshots should weigh more on its audio features. With this multimodal approach, we trained visual and auditory feature detectors and later combined them into a decision neural network to give us a violence detector that considers several different aspects of the problem. This robust and modular approach allows different cultures and users to adapt the detector to their specific needs.

Referência(s)