A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

Chen, Haoran; Lin, Ke; Maye, Alexander; Li, Jianmin; Hu, Xiaolin

doi:10.3389/frobt.2020.475767

ORIGINAL RESEARCH article

Front. Robot. AI, 30 September 2020
Sec. Smart Sensor Networks and Autonomy
Volume 7 - 2020 | https://doi.org/10.3389/frobt.2020.475767

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

Haoran Chen¹

Ke Lin²

Alexander Maye³

Jianmin Li¹

Xiaolin Hu¹^*

¹The State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China
²Samsung Research China, Beijing, China
³Department of Neurophysiology and Pathophysiology, University Medical Center, Hamburg, Germany

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, the Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, leading to poor performance. Third, current video captioning models are prone to generate relatively short captions that express video contents inappropriately. Toward resolving these three problems, we suggest three corresponding improvements. First of all, we propose a metric to compare the quality of semantic features, and utilize appropriate features as input for a semantic detection network (SDN) with adequate complexity in order to generate meaningful semantic features for videos. Then, we apply a scheduled sampling strategy that gradually transfers the training phase from a teacher-guided manner toward a more self-teaching manner. Finally, the ordinary logarithm probability loss function is leveraged by sentence length so that the inclination of generating short sentences is alleviated. Our model achieves better results than previous models on the YouTube2Text dataset and is competitive with the previous best model on the MSR-VTT dataset.

1. Introduction

Video captioning aims to automatically generate a concise and accurate description for a video. It requires techniques both from computer vision (CV) and natural language processing (NLP). Deep learning (DL) methods for sequence-to-sequence learning are able to learn the map from discrete color arrays to dense vectors, which is utilized to generate natural language sequences without the interference of humans. These methods produced impressive results on this task compared with the results yielded by manually crafted features.

It has gained increasing attention in video captioning that the semantic meaning of a video is critical and beneficial for an RNN to generate annotations (Pan et al., 2016; Gan et al., 2017). Keeping semantic consistency between video content and video description helps to refine a generated sentence in semantic richness (Gao et al., 2017). But few researches have explored methods to obtain video semantic features, metrics to measure their quality and the relation between video captioning performance and meaningfulness of semantic features.

Several training strategies have been used to optimize video captioning models, such as the Teacher Forcing algorithm and CIDEnt-RL (Pasunuru and Bansal, 2017b). The Teacher Forcing algorithm is a simple and intuitive way to train RNNs. But it suffers from the discrepancy between training, which utilizes ground truth to guide word generation at each step, and inference, which samples from the model itself at each step. Reinforcement learning (RL) techniques have also been adopted to improve the training process of video captioning. CIDEnt-RL is one of the best RL algorithms, but it is extremely time-consuming to calculate metrics for every batch. In addition, the improvement on different metrics is unbalanced. In other words, the improvements on other metrics are not as large as that on the specific metrics optimized directly.

The commonly used loss function for video captioning is comprised of the logarithm of probabilities of target correct words (Donahue et al., 2015; Venugopalan et al., 2015). A long sentence tends to bring high loss to the model, as each additional word reduces the joint probability by roughly at least one order of magnitude. In contrast, a short sentence with few words has a relatively low loss. Thus, a video captioning model is prone to generate short sentences after being optimized by a log likelihood loss function. Excessively short annotations may neither be able to describe a video accurately nor express the content of a video in a rich language.

We propose to improve solutions to the video captioning task in three aspects. Firstly, we use mean average precision (mAP) as the metric to evaluate the quality of semantic information. By virtue of the evaluation metric, we build our semantic detection network (SDN) with a proper scale and the best inputs that brings the best performance, and, consequently, SDN is able to produce meaningful and accurate semantic features for a video. Secondly, we take advantage of a scheduled sampling method to train our video captioning model, which searches extreme points in the RNN state space more extensively as well as bridges the gap between training process and inference (Bengio et al., 2015). Thirdly, we optimize our model by a sentence-length-modulated loss function, which encourages the model to generate longer captions with more detail.

Our implementation, available on GitHub¹, is based on the TensorFlow deep learning framework.

2. Related Works

2.1. Image Captioning

The encoder-decoder paradigm has been widely applied by researchers in image captioning since it was introduced to machine translation (Cho et al., 2014). It has become a mainstream method in both image captioning and machine translation (Mao et al., 2014; Vinyals et al., 2015). Inspired by successful attempts to employ attention in machine translation (Bahdanau et al., 2015) and object detection (Ba et al., 2015), models that are able to attend to key elements in an image are investigated for the purpose of generating high-quality image annotations. Semantic features (You et al., 2016) and object features (Anderson et al., 2018) are incorporated into attention mechanisms as heuristic information to guide selective and dynamic attendance of salient segments in images. RL techniques, which optimize specific metrics of a model directly, are also adopted to enhance the performance of image captioning models (Rennie et al., 2017). Graph Convolutional Networks (GCNs) have been introduced to cooperate with RNN to integrate both semantic and spatial information into image encoders in order to generate efficient representations of an image (Yao et al., 2018). Stimulated by the success of the Transformer model in machine translation, researchers extend it to a multimodal model for image captioning (Yu et al., 2019), which utilizes multi-view visual features to further improve the performance. Multi-level relationships between image regions are learnt and both low- and high-level features are exploited at the decoding stage in the Meshed Transformer with memory for image captioning (Cornia et al., 2019).

2.2. Video Captioning

Though both image captioning and video captioning are multi-modal tasks, video captioning is probably harder than the former one, as videos show not only spatial features but also temporal correlations.

Following the successful adoption of the encoder-decoder paradigm in image captioning, multimodal features of videos are fed into a sequence-to-sequence model to generate video descriptions with the assistance of pretrained models in image classification (Donahue et al., 2015; Venugopalan et al., 2015). In order to alleviate the semantic inconsistency between the video content and the generated caption, visual features and semantic features of a video are mapped to a common embedding space so that semantic consistency may be achieved by minimizing the Euclidean distance between these two embedded features (Pan et al., 2016). A model named POS generates video captions with Part-of-Speech (POS) information and multiple representations of video clips (Wang et al., 2019a). MARN exploits a memory structure to explore the relation between a word and its various visual contexts across the training data (Pei et al., 2019). JSRL-VCT manages to generate video descriptions by corporating visual representations and syntax representations (Hou et al., 2019). GRU-EVE captures rich temporal dynamics in video features by Short Fourier Transform, and extracts semantic information from an object detector (Aafaq et al., 2019). Zheng et al. (2020) propose a Syntax-Aware Action Targeting (SAAT) component to learn an action and its subjects that exist in a video for better semantic consistency in captioning.

RNN, especially LSTM, can be extended by integrating high-level tags or attributes of video with visual features of the video through embedding and element-wise addition/multiplication operations (Gan et al., 2017). Yu et al. (2016) exploit a sentence generator that is built upon an RNN module to model language, a multimodal layer to integrate different modal information, and an attention module to dynamically select salient features from the input. The output of a sentence generator is fed into a paragraph generator for describing a relatively long video with several sentences.

Following the attention mechanism introduced by Xu et al. (2015), Gao et al. (2017) capture the salient structure of video with the help of visual features of the video and context information provided by LSTM. Although bottom-up (Anderson et al., 2018) and top-down attention (Ramanishka et al., 2017) have been proposed for image captioning, selectively focusing on salient regions in an image is, to some extent, similar to picking key frames in a video (Chen et al., 2018). Wang et al. (2018) explore crossmodal attention at different granularity levels and capture global temporal structures as well as local temporal structures implied in multimodal features to assist the generation of video captions.

Due to the lack of labeled video data and the abundance of unlabeled video data, Pasunuru and Bansal (2017a) and Sun et al. (2019) propose to improve video captioning with self-supervised learning tasks or unsupervised learning tasks, such as unsupervised video prediction, entailment generation and text-to-video generation. Pasunuru and Bansal (2017a) demonstrate that multi-task training contributes to sharing knowledge across different domains, and each task, including video captioning, benefits from the training of other irrelevant tasks. Sun et al. (2019) take advantage of the abundance of unlabeled videos on YouTube and train the BERT model introduced in Devlin et al. (2018) on comparably large-scale videos, which is then used as a feature extractor for video captioning. A large amount of pre-training data is critical to BERT models both in video captioning and machine translation (Devlin et al., 2018; Sun et al., 2019). By aggregating different experts on different known activities, Wang et al. (2019b) take advantage of external textual corpora and transfer knowledge to unseen data for zero-shot video captioning. A spatio-temporal graph model is built to find object interactions and knowledge distillation mechanism is proposed to increase stability of performance (Pan et al., 2020).

2.3. RNN Training Strategy

The traditional method to train an RNN is the Teacher Forcing algorithm (Williams and Zipser, 1989), which feeds human annotations to the RNN as input at each step to guide the token generation during training and samples a token from the model itself as input during inference. The different sources of input tokens during training and inference lead to the inability of the model to generate high-quality tokens in inference, as errors may accumulate along the sequence generation.

Bengio et al. (2015) propose to switch gradually from guiding generation by true tokens to feeding sampled tokens during training, which helps RNN models adapt to the inference scheme in advance. It has been applied to image captioning and speech recognition. Inspired by Huszar (2015), who mathematically proves that both the Teacher Forcing algorithm and Curriculum Learning have a tendency to learn a biased model, Goyal et al. (2016) solve the problem by adopting an adversarial domain method to align the dynamics of the RNN during training and inference. Zhang et al. (2020) propose an object relational graph (ORG) to encode interaction features and design a teacher-recommended learning (TRL) method to utilize linguistic knowledge.

Inspired by the successful application of RL methods in image captioning (Rennie et al., 2017; Pasunuru and Bansal, 2017b) propose a modified reward that compensates for the logical contradiction in phrase-matching metrics as the direct optimization target in video captioning. The gradient of the non-differentiable RL loss function is computed and back-propagated by the REINFORCEMENT algorithm (Williams, 1992). But calculation of the reward for each training batch adds a non-negligible computation cost to the training process and slows down the optimization progress. In addition, the improvements of RL methods on various metrics are not comparable with the improvement on the specific metric used as RL reward.

3. The Proposed Approaches

We consider the video captioning task as a supervised task. The training set is annotated as N pairs of ${X_{i}, {\hat{Y}}_{i}}$ , where X_i denotes a video and ${\hat{Y}}_{i}$ represents the corresponding target caption. Suppose there are M frames from a video and a caption consisting of L_i words, then we have:

\begin{array}{l} X_{i} = {x_{i, 0}, x_{i, 1}, \dots, x_{i, M - 1}}, \\ {\hat{Y}}_{i} = {{\hat{y}}_{i, 0}, {\hat{y}}_{i, 1}, \dots, {\hat{y}}_{i, L_{i} - 1}}, & (1) \end{array}

where each x denotes a single frame and each y denotes a word belonging to a fixed known dictionary.

A pretrained model is used to produce word embeddings, and we obtain a low-dimension embedding of the caption ${\hat{Y}}_{i} \in ℝ^{L_{i} \times D_{w}}$ :

\begin{array}{l} {\hat{Y}}_{i} = {(w_{i, 0}, w_{i, 1}, \dots, w_{i, L_{i} - 1})}^{T}, w_{i, j} \in ℝ^{D_{w}}, & (2) \end{array}

where D_w is the dimension of the word embedding space.

3.1. Encoder-Decoder Paradigm

3.1.1. Encoder

Our encoder is composed of a 3D ConvNet, a 2D ConvNet and a semantic detection network (SDN). The 3D ConvNet is utilized to produce the spatio-temporal feature $e_{i} \in ℝ^{D_{e}}$ for the ith video. The 2D ConvNet is supposed to find the static visual feature $r_{i} \in ℝ^{D_{r}}$ for the ith video. The visual spatio-temporal representation of the ith video can then be obtained by concatenating both features together as follows:

\begin{array}{l} v_{i} = (\begin{matrix} r_{i} \\ e_{i} \end{matrix}) \in ℝ^{D_{v}}, & (3) \end{array}

where D_v = D_e + D_r.

For semantic detection, we manually select the K most common and meaningful words, which consists of the most frequent nouns, verbs or adjectives, from both the training set and the validation set as candidate tags for all videos (Gan et al., 2017). The semantic detection task is treated as a multi-label classification task with v_i as the representation of the ith video and ${\hat{s}}_{i} = {{\hat{s}}_{i, 0}, {\hat{s}}_{i, 1}, \dots, {\hat{s}}_{i, K - 1}} \in {0, 1}^{K}$ as the ground truth. If the jth tag exists in the annotations of the ith video, then $\hat{s}$ _i,j = 1; otherwise, $\hat{s}$ _i,j = 0. Suppose s_i is the semantic feature of the ith video. Then, we have $s_{i} = σ (f (v_{i})) \in {(0, 1)}^{K}$ , where f (·) is a non-linear mapping and σ(·) a sigmoid activation function. Mean average precision is applied to evaluate the quality of semantic features. A multi-layer perceptron (MLP) of adequate scale is exploited to learn semantic representations from the samples. The set of input features is determined by the experimental results for each dataset. The SDN is trained by minimizing the loss function:

\begin{array}{l} L (s_{i}, {\hat{s}}_{i}) = \frac{1}{N} \sum_{i = 0}^{N - 1} \sum_{j = 0}^{K - 1} {\hat{s}}_{i, j} log s_{i, j} + (1 - {\hat{s}}_{i, j}) log (1 - s_{i, j}) . & (4) \end{array}

A probability distribution of tags s_i is produced by the SDN to represent the semantic content of the ith video in the training set, the validation set or the test set.

3.1.2. Decoder

Standard RNNs (Elman, 1990) are capable of learning temporal patterns from input sequences. But they suffer from the gradient vanishing/explosion problem, which results in their inability to generalize to long sequences. LSTM (Hochreiter and Schmidhuber, 1997) is a prevailing variant of RNN that alleviates the long-term dependency problem by using gates to update the cell state, but it ignores the semantic information of the input sequence. We use SCN(Semantic Compositional Network) (Gan et al., 2017), a variant of LSTM, as our decoder, because it not only avoids the long-term dependency problem but also takes advantage of semantic information of the input video. Suppose we have a video feature v, a semantic feature s, an input vector x_t at time step t and a hidden state h_t−1 at time step t − 1. The SCN integrates semantic information s into v, x_t, and h_t−1, respectively, and obtains the semantics-related video feature $\hat{v}$ , the semantics-related input ${\hat{x}}_{t}$ and the semantics-related hidden state ${\hat{h}}_{t - 1}$ as follows:

\begin{array}{l} {\hat{x}}_{z, t} = W_{z, c} \cdot ((W_{z, a} \cdot x_{t}) ⊙ (W_{z, b} \cdot s)), z \in {c, i, f, o}, \\ {\hat{v}}_{z} = C_{z, c} \cdot ((C_{z, a} \cdot v) ⊙ (C_{z, b} \cdot s)), z \in {c, i, f, o}, \\ {\hat{h}}_{z, t - 1} = U_{z, c} \cdot ((U_{z, a} \cdot h_{t - 1}) ⊙ (U_{z, b} \cdot s)), z \in {c, i, f, o}, & (5) \end{array}

where c, i, f and o denote the cell state, the input gate, the forget gate and the output gate, respectively.

Then input gate i_t, forget gate f_t and output gate o_t at time step t are calculated, respectively, in a way similar to the standard LSTM:

\begin{array}{l} i_{t} = σ ({\hat{x}}_{i, t} + {\hat{h}}_{i, t - 1} + {\hat{v}}_{i} + b_{i}), \\ f_{t} = σ ({\hat{x}}_{f, t} + {\hat{h}}_{f, t - 1} + {\hat{v}}_{f} + b_{f}), \\ o_{t} = σ ({\hat{x}}_{o, t} + {\hat{h}}_{o, t - 1} + {\hat{v}}_{o} + b_{o}), & (6) \end{array}

where σ denotes the logic sigmoid function $σ (x) = \frac{1}{1 + e^{- x}} \in (0, 1)$ and b is a bias term for each gate.

The raw cell state at the current step t can be computed as follows:

\begin{array}{l} {\hat{c}}_{t} = tanh ({\hat{x}}_{c, t} + {\hat{h}}_{c, t - 1} + {\hat{v}}_{c} + b_{c}), & (7) \end{array}

where tanh denotes the hyperbolic function $tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} \in (- 1, 1)$ and b_c is the bias term for the cell state. The input gate i_t is supposed to control the throughput of the semantic-related input ${\hat{x}}_{t}$ , and the forget gate f_t is designed to determine the preservation of the previous cell state c_t−1. Thus, we have the final cell state c_t at time step:

\begin{array}{l} c_{t} = f_{t} * c_{t - 1} + i_{t} * {\hat{c}}_{t} . & (8) \end{array}

The output gate controls the throughput ratio of the cell state c_t so that the cell output h_t can be determined by:

\begin{array}{l} h_{t} = o_{t} * tanh (c_{t}) . & (9) \end{array}

The semantics-related variables ${\hat{x}}_{t}$ , $\hat{v}$ , ${\hat{h}}_{t - 1}$ , and ${\hat{c}}_{t}$ are dependent on semantic feature s so that the SCN takes semantic information of the video into account implicitly. The forget gate f_t is a key component in updating c_t−1 to c_t, which, to some degree, avoids the long-term dependency problem. The overview of the SCN unit is showed in Figure 1.

FIGURE 1

Figure 1. The figure of SCN unit. σ, ϕ, semantic fusion (f) denotes a sigmoid function, a tanh function and Equation (5), respectively. ⊙ and ⊕ in a circle denote element-wise product and element-wise addition, respectively.

3.2. Training Method

In the context of the RNN trained with the Teacher Forcing algorithm, the logarithmic probability P(Y_i|X_i; Θ) of a given triplet of input/output/label (X_i, Y_i, Ŷ_i) and given model parameters Θ can be calculated as:

\begin{array}{l} P (Y_{i} | X_{i}; Θ) = \sum_{t = 0}^{L_{i} - 1} log P (y_{i, t} | {\hat{y}}_{i, 0}, \dots, {\hat{y}}_{i, t - 1}, X_{i}; Θ), & (10) \end{array}

where L_i is the length of output.

In the case of SCN, the joint logarithmic probability can be computed as:

\begin{array}{l} P (Y_{i} | X_{i}; Θ) = \sum_{t = 0}^{L_{i} - 1} log P (y_{i, t} | {\hat{y}}_{i, 0}, \dots, {\hat{y}}_{i, t - 1}, s_{i}, X_{i}; Θ), \\ = \sum_{t = 0}^{L_{i} - 1} log P (y_{i, t} | h_{i, t - 1}, c_{i, t - 1}, {\hat{y}}_{i, t - 1}, s_{i}, X_{i}; Θ), & (11) \end{array}

where h_i,t, c_i,t, and s_i are the output state, the cell state and the semantic feature of the ith video, respectively.

To some extent, h_i,t and c_i,t can be viewed as the aggregation of all the previous information. We can compute them using the recurrence relation:

\begin{array}{l} h_{i, t} = {\begin{matrix} f (X_{i}, h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ) & if t = 0, \\ f ({\hat{y}}_{i, t - 1}, h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ) & if t > 0, \end{matrix} \\ c_{i, t} = {\begin{matrix} g (X_{i}, h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ) & if t = 0, \\ g ({\hat{y}}_{i, t - 1}, h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ) & if t > 0, \end{matrix} & (12) \end{array}

where h_i,−1 = 0, c_i,−1 = 0. In inference, we need to replace $\hat{y}$ _i,t with y_i,t, which may lead to the accumulation of prediction errors.

In order to bridge the gap between training and testing in the Teacher Forcing algorithm, we train our video captioning model with scheduled sampling. Scheduled sampling transfers the training process gradually from using ground truth words Ŷ_i for guiding to using sampled words Y_i for guiding at each recurrent step. The commonly used strategy to sample a word from the output distribution is argmax. But the search scope is limited to a relatively small part of the search space, since it always selects the word with the largest probability. For the sake of enlarging the search scope, we draw a word randomly from the output distribution as a part of the input for the next recurrent step. In this way, words with higher probabilities are more likely to be chosen. The randomness of the sampling procedure will enable the recurrent network to explore a relatively large range of the network state space. In addition, the network is less likely to get stuck in a local minimum. In the perspective of training machine learning models, the multinomial sampling strategy reduces overfitting of the network; in other words, it acts like a regularizer.

Our method to optimize the language model consists of two parts: the outer loop schedule the sampling probability at each recurrent step (Algorithm 1), while the algorithm inside the RNN (Algorithm 2) specifies the procedure to sample from the output of a model with a given possibility as a part of the input for the next step of the RNN.

ALGORITHM 1

Algorithm 1. Scheduling Algorithm: schedule the ϵ across epochs.

ALGORITHM 2

Algorithm 2. Random Sampling Algorithm: specific procedures in RNN.

3.3. Sentence-Length-Related Loss Function

What is a good description for a video? A good description should be both accurate and concise. In order to achieve this goal, we design a sentence-length-modulated loss function for our model as follows:

\begin{array}{l} Loss ({\hat{y}}_{i}, s_{i}, X_{i}; Θ) = - \sum_{i = 0}^{b_{s} - 1} \frac{1}{L_{i}^{β}} \sum_{t = 0}^{L_{i} - 1} log p ({\hat{y}}_{i, t} | h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ), & (13) \end{array}

where b_s is the batch size and β >= 0 is a hyper-parameter that is used to keep a balance between the conciseness and the accuracy of the generated captions. If β = 0, it is a loss function commonly used in video captioning tasks:

\begin{array}{l} Loss ({\hat{y}}_{i}, s_{i}, X_{i}; Θ) = - \sum_{i = 0}^{b_{s} - 1} \sum_{t = 0}^{L_{i} - 1} log p ({\hat{y}}_{i, t} | h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ) . & (14) \end{array}

In this loss function, a long sentence has greater loss than a short sentence. Thus, after minimizing the loss, the RNN is inclined to generate relatively short annotations that may be incomplete in semantics or sentence structure. If β = 1, all words in the generated captions are treated equally in the loss function as well as in the process of optimization, which may lead to redundancy or duplicate words in the process of generating captions.

Thus, we have the following optimization problem:

\begin{array}{l} Θ = arg min_{Θ} - \sum_{i = 0}^{N - 1} \frac{1}{L_{i}^{β}} \sum_{t = 0}^{L_{i} - 1} log p ({\hat{y}}_{i, t} | h_{i, t - 1}, c_{i, t - 1}, s_{i}, X_{i}; Θ), & (15) \end{array}

where N is the size of the training data and Θ is the parameter of our model.

GNMT, Google's Neural Machine Translation system, employs a similar length-normalization technique in the beam search during test, but not during training (Wu et al., 2016). In contrast, our model abandons beam search in the decoder, and the model parameters are optimized by the sentence-length-modulated loss function (13). Note that beam search makes the decoding process slower.

The overall structure of our model is visualized in Figure 2. Our SDN and visual feature extractors in the encoder component share the same 2D ConvNet and 3D ConvNet in practice.

FIGURE 2

Figure 2. Overall framework of our model. A 3D ConvNet, a 2D ConvNet and a Semantic Detection Net (SDN) constitute the encoder component of our model. S-LSTM stands for a semantics-assisted variant of LSTM which takes a semantic feature, a visual feature and a word embedding as inputs at each step. The word fed as the input to the decoder is sampled from human annotations or the model itself randomly, and then is embeded with the pretrained weights.

4. Experiments

We evaluate our model on two popular video captioning datasets to show the performance of our approach. We compare our results to other existing methods.

4.1. Datasets

4.1.1. YouTube2Text

The YouTube2Text or MSVD (Chen and Dolan, 2011; Guadarrama et al., 2013) dataset, published in 2013, contains 1970 short YouTube video clips. The average length of them is about 10 seconds. We get roughly 40 descriptions for each video. We follow the dataset split setting used in prior studies (Pan et al., 2016; Yu et al., 2016; Gan et al., 2017), in which the training dataset contains 1200 clips, the validation dataset contains 100 clips, and the rest of them belong to the test dataset. We tokenize the captions from the training and validation datasets and obtain approximately 14,000 unique words. Twelve thousand five hundred and ninety-two of them are utilized for prediction, and the remaining words are replaced by < unk >. We add the token < eos > to signal the end of a sentence.

4.1.2. MSR-VTT

MSR-Video to Text (MSR-VTT) (Pan et al., 2016; Xu et al., 2016) is a large-scale video benchmark, first presented in 2016. In its first version, MSR-VTT provided 10k short video segments with 200k descriptions in total. Each video segment was described by about 20 independent English sentences. In its second version, which was published in 2017, MSR-VTT provides additional 3k short clips as a testing set, and video clips in the first version can be used as training and validation sets. Because of lacking human annotations for the test set in the second version, we perform experiments on the first version. We tokenize and obtain 14,071 unique words that appear in the training set and validation set of MSR-VTT 1.0 more than once. Thirteen thousand seven hundred and ninety-four of them are indexed with integer numbers starting at 0, and the rest are substituted by < unk >. < eos >, which signifies the end of a sentence, is added to the vocabulary of MSR-VTT.

4.2. Overall Score

Based on the widely used BLEU, METEOR, ROUGE-L, and CIDEr metrics, we propose an overall score to evaluate the performance of a language model:

\begin{array}{l} S_{o v e r a l l} = \frac{B-4}{t o p 1 (B-4)} + \frac{C}{t o p 1 (C)} + \frac{M}{t o p 1 (M)} + \frac{R}{t o p 1 (R)} \in [0, 1], & (16) \end{array}

where B-4 denotes BLEU-4, C denotes CIDEr, M denotes METEOR, R represents ROUGE-L and top1(·) denotes the best numeric value of the specific metric. We presume that BLEU-4, CIDEr, METEOR, and ROUGE-L reflect one particular aspect of the performance of a model respectively. First, we normalize each metric value of a model, and then we take the mean value of them as an overall measurement for that model (16). If the result of a model on each metric is closer to the best result of all models, the overall score will be close to 1. If and only if a model has the state-of-the-art performance on all metrics, the overall score is 1. If a model is much lower than the state-of-the-art result on each metric, the overall score of the model will be close to 0.

4.3. Training Details

Our visual feature consists of two parts: a static visual feature and a dynamic visual feature. ResNeXt (Xie et al., 2017), which is pretrained on the ImageNet ILSVRC2012 dataset, is utilized as the static visual feature extractor in the encoder of our model. The ECO (Zolfaghari et al., 2018), which is pretrained on the Kinetics-400 dataset, is utilized as the dynamic visual feature extractor for the encoder in our model. More specifically, 32 frames are extracted from each video clip evenly. For each video, we feed 32 frames as input to ResNeXt, take the conv5/block3 output, and apply average pooling to these outputs along the time axis. The newly obtained 2048-dim feature vector is taken as the 2D representation of that video. What's more, we take the 1536-way feature of the global pool in ECO as the 3D representation of each video. Global Vectors for Word Representations (GloVe) (Pennington et al., 2014) is used as the pretrained word embedding model in our experiments. And it is fixed during our training processes.

We set the initial learning rate to 2 × 10⁻⁴ for the YouTube2Text dataset and 4 × 10⁻⁴ for the MSR-VTT dataset. In addition, we drop the learning rate by 0.316 every 20,350 steps for the MSR-VTT dataset. Batch size is set to 64, and the Adam algorithm is applied to optimize the model for both datasets. The hyper-parameter β₁ is set to 0.9, β₂ is set to 0.999, and ϵ is set to 1 × 10⁻⁸ for the Adam algorithm. Each model is trained for 50 epochs, in which the hyper parameter sample probability ϵ is set as ep × 0.008 for the epth epoch. We fine-tune the hyper-parameters of our model on the validation sets and select the best checkpoint for testing according to the overall score of the evaluation on the validation set.

4.4. Comparison With Existing Models

Empirically, we evaluate our method on the YouTube2Text/MSVD (Guadarrama et al., 2013) and MSR-VTT (Xu et al., 2016) datasets. We report the results of our model along with a number of existing models in Tables 1, 2.

TABLE 1

Table 1. Result comparison with existing models on the YouTube2Text dataset.

TABLE 2

Table 2. Result comparison with existing models on the MSR-VTT dataset.

4.4.1. Comparison on the YouTube2Text Dataset

Table 1 displays the performance of several models on YouTube2Text. We compare our model with existing methods, including LSTM-E (Pan et al., 2016), h-RNN (Yu et al., 2016), aLSTMs (Gao et al., 2017), SCN (Gan et al., 2017), MTVC (Pasunuru and Bansal, 2017a), ECO (Zolfaghari et al., 2018), SibNet (Liu et al., 2018), POS (Wang et al., 2019a), MARN (Pei et al., 2019), JSRL-VCT (Hou et al., 2019), GRU-EVE (Aafaq et al., 2019), STG-KD (Pan et al., 2020), SAAT (Zheng et al., 2020), and ORG-TRL (Zhang et al., 2020). Our method outperforms all the other methods on all the metrics by a large margin. Note that many of them were published after our initial submission of the present work in the end of May in 2019. Specifically, compared with ORG-TRL (Zhang et al., 2020), the previous state-of-the-art model on this dataset, BLEU-4, CIDEr, METEOR, and ROUGE-L are improved relatively by 14.9, 15.2, 7.1, and 4.2%, respectively. Our model has the highest overall score as defined in (16).

4.4.2. Comparison on the MSR-VTT Dataset

Table 2 displays the evaluation results of several video captioning models on the MSR-VTT. In this table, we compare our model with existing models, including MTVC (Pasunuru and Bansal, 2017a), CIDEnt-RL (Pasunuru and Bansal, 2017b), SibNet (Liu et al., 2018), HACA (Wang et al., 2018), TAMoE (Wang et al., 2019b), POS (Wang et al., 2019a), MARN (Pei et al., 2019), JSRL-VCT (Hou et al., 2019), GRU-EVE (Aafaq et al., 2019), STG-KD (Pan et al., 2020), SAAT (Zheng et al., 2020), ORG-TRL (Zhang et al., 2020). According to the overall score defined in (16), ORG-TRL is the best among existing models. Our model achieves higher values on all metrics than this model. Two models POS and JSRL-VCT achieve slightly higher CIDEr value and METEOR values than our model, respectively, but their other metric values are clearly lower than our results.

Our model achieves better results on both the YouTube2Text dataset and the MSR-VTT dataset. Note that our model is only trained on a single dataset without an attention mechanism, and it is tested without ensemble or beam search.

5. Model Analysis

In this section, we discuss the utility of the three improvements on our model.

5.1. Analysis on Semantic Features

Semantic features are the output of a multi-label classification task. Mean average precision (mAP) is often used to evaluate the results of multi-label classification tasks (Tsoumakas and Katakis, 2007). Here, we apply it to evaluate the quality of semantic features.

5.1.1. Semantic Features Predicted With Different Sets of Input Features

Figures 3, 4 demonstrate the quality of semantic features, using different sets of feature maps as inputs, with respect to the training epochs. Figure 3 shows that, on the YouTube2Text dataset, the mAP values are proportional to training epochs. With the same number of training epochs, the qualities of semantic features are in the order: ECO-ResNeXt > ResNeXt > ECO, where ECO-ResNeXt, ResNeXt, and ECO denote the models trained with visual features from ECO-ResNeXt, ResNeXt, or ECO, respectively. Figure 4 demonstrates that, on the MSR-VTT dataset, both mAP values of semantic information decline after the models are trained for more than 800 epochs with ResNeXt feature maps or ECO-ResNeXt feature maps as inputs. With ECO feature maps as inputs, the performance of the semantic detection model is still proportional to the training epochs.

FIGURE 3

Figure 3. The quality of semantic features predicted with different sets of input features evaluated by mAP on the YouTube2Text. “ResNeXt,” “ECO,” and “ECO-ResNeXt” denote that the semantic models are trained and the semantic features are predicted with visual features produced by ResNeXt, ECO, both ECO and ResNeXt, respectively.

FIGURE 4

Figure 4. The quality of semantic features predicted with different sets of input features evaluated by mAP on the MSR-VTT dataset.

5.1.2. Models Trained With Different Semantic Features

Tables 3, 4 list the performance of our model trained by scheduled multinomial sampling with different semantic features on the YouTube2Text and MSR-VTT datasets, respectively. The results clearly show that a better multi-label classification enables a better video captioning model. Semantic features with higher mAP provide more appropriate potential attributes of a video for the model. Thus, the model is able to generate better video annotations by comprehensively considering semantic features, spatio-temporal features, and contextual information.

TABLE 3

Table 3. Results of scheduled sampling methods (multinomial sampling) on the YouTube2Text dataset with different sets of semantic features.

TABLE 4

Table 4. Results of scheduled sampling methods (multinomial sampling) on MSR-VTT data with different sets of semantic features.

5.2. Analysis on the Scheduled Sampling

Tables 5, 6 show the comparison among the Teacher Forcing algorithm, scheduled sampling with the argmax strategy and scheduled sampling with the multinomial strategy on YouTube2Text and MSR-VTT datasets, respectively. Teacher Forcing utilizes human annotations to guide the generation of words during training and samples from the word distribution of the output of the model to direct the generation during inference. The argmax strategy switches gradually from the Teacher Forcing way to sample words with the largest possibility from the model itself during training. The Multinomial strategy is similar to the argmax strategy but samples words randomly from the distribution of the model at each step. As we can infer from Tables 3, 4, the scheduled sampling with the multinomial strategy yields a better performance than the other two methods on the YouTube2Text dataset and the one with the argmax strategy yields the best performance on the MSR-VTT dataset. Our method explores a larger range of RNN state space and thus is likely to find a better solution during training.

TABLE 5

Table 5. Results of different training strategies on YouTube2Text data with the best semantic features.

TABLE 6

Table 6. Results of different training strategies on MSR-VTT data with the best semantic features.

5.3. Analysis on the Length Normalization of the Loss Function

As demonstrated in Table 7, the average length of human annotations is larger than those generated by models with β = {0, 0.7, 1} (13), respectively. But Figure 5 displays the tendency of redundancy in captions generated by the β = 1 model, which deteriorates the overall quality of model-generated sentences. The average caption length of the model with β = 0.7 is greater than that of the model with β = 0, whereas it is smaller than that from the model with β = 1. The model with β = 0.7 generates relatively long annotations for videos without suffering from redundancy or duplication of words, and we therefore consider it the optimal choice.

TABLE 7

Table 7. Average length of the captions in the test set.

FIGURE 5

Figure 5. Examples of machine-generated captions and human annotations (GT).

6. Conclusion

We suggest three improvements for solving the video captioning task. First, mAP is applied to evaluate the quality of semantic information, and a SDN with adequate computation complexity and input features is used to extract high-quality semantic features from videos, which contributes to the success of our semantics-assisted model. Second, we employ a scheduled sampling training strategy. Third, a sentence-length-modulated loss function is proposed to keep the model in a balance between language redundancy and conciseness. Our method achieves results that are superior to the state-of-the-art on the YouTube2Text dataset. The performance of our model is comparable to the state-of-the-art on the MSR-VTT dataset. In the future, we may obtain further improvements in video captioning by integrating spatio-temporal attention mechanisms with visual-semantics features.

Data Availability Statement

The YouTube2Text/MSVD dataset could be obtained from http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ and the MSR-VTT dataset could be obtained from http://ms-multimedia-challenge.com/2016/dataset.

Ethics Statement

Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Author Contributions

HC designed and performed the experiments. HC, JL and XH analyzed the experimental results and wrote the article. KL and AM analyzed the data and polished the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700904, in part by the National Natural Science Foundation of China under Grant 61621136008, in part by the German Research Council (DFG) under Grant TRR-169, and in part by Samsung under contract NO. 20183000089.

Conflict of Interest

KL is employed by Samsung Research China, Beijing (SRC-B), China. The authors declare that this study received funding from Samsung Research China, Beijing (SRC-B). The funder had the following involvement in the study: study design, data analysis and writing of the article.

Acknowledgments

The authors thank Han Liu, Hallbjorn Thor Gudmunsson, and Jing Wen for valuable and insightful discussions.

Footnotes

1. ^https://github.com/WingsBrokenAngel/Semantics-AssistedVideoCaptioning/tree/master

References

Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., and Mian, A. (2019). “Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Long Beach, CA). doi: 10.1109/CVPR.2019.01277

ORIGINAL RESEARCH article

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

1. Introduction

2. Related Works

2.1. Image Captioning

2.2. Video Captioning

2.3. RNN Training Strategy

3. The Proposed Approaches

3.1. Encoder-Decoder Paradigm

3.1.1. Encoder

3.1.2. Decoder

3.2. Training Method

3.3. Sentence-Length-Related Loss Function

4. Experiments

4.1. Datasets

4.1.1. YouTube2Text

4.1.2. MSR-VTT

4.2. Overall Score

4.3. Training Details

4.4. Comparison With Existing Models

4.4.1. Comparison on the YouTube2Text Dataset

4.4.2. Comparison on the MSR-VTT Dataset

5. Model Analysis

5.1. Analysis on Semantic Features

5.1.1. Semantic Features Predicted With Different Sets of Input Features

5.1.2. Models Trained With Different Semantic Features

5.2. Analysis on the Scheduled Sampling

5.3. Analysis on the Length Normalization of the Loss Function

6. Conclusion

Data Availability Statement

Ethics Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Footnotes

References

This article is part of the Research Topic

People also looked at