The goal of human-object interaction (HOI) detection is to localize both the human and object in a picture and recognize the interactions between them. HOIs are always scattering in the image. The traditional methods based on CNNs are unable to aggregate the information scattered in the image. Many new methods utilizing the contextual features cropped from the outputs of the CNNs, which sometimes are not effective enough. To overcome the challenge, we utilize the deformable transformer to aggregate the whole features output form the CNNs. The attention mechanism and query-based predictions are the keys. In view of the success of the methods based on graph neural networks, the attention mechanism is proved to be effective to aggregate the contextual information image-wide. The queries can extract the features of each human-object pair without mixing up the features of other instances. The deformable transformer can extract effective embeddings and the prediction heads can be fairly simple. Experimental results show that the proposed method is effective in HOI detection.
Recent years, deep neural networks have achieved impressive progress in object detection. However, detecting the interactions between objects is still challenging. Many researchers pay attention to human-object interaction (HOI) detection as a basic task in detailed scene understanding. Most conventional HOI detectors are in a two-stage manner and usually slow in inference. One-stage methods for direct parallel detection of HOI triples breaks through the limitation of object detection, but the extracted features are still insufficient. To overcome these drawbacks above, we propose an improved one-stage HOI detection approach, in which attention aggregation module and dynamic point matching strategy play key roles. The attention aggregation enhances the semantic expression ability of interaction points explicitly by aggregating contextually important information, while the matching strategy can filter the negative HOI pairs effectively in the inference stage. Extensive experiments on two challenging HOI detection benchmarks: VCOCO and HICO-DET show that our method achieves considerable performance compared to state-of-the-art performance without any additional human pose and language features.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.