深度学习Course5第四周Transformers习题整理
- A Transformer Network processes sentences from left to right, one word at a time.
- False
- True
- Transformer Network methodology is taken from:
- GRUs and LSTMs
- Attention Mechanism and RNN style of processing.
- Attention Mechanism and CNN style of processing.
- RNN and LSTMs
- **What are the key inputs to computing the attention value for each word? **
- The key inputs to computing the attention value for each word are called the query, knowledge, and vector.
- The key inputs to computing the attention value for each word are called the query, key, and value.
- The key inputs to computing the attention value for each word are called the quotation, key, and vector.
- The key inputs to computing the attention value for each word are called the quotation, knowledge, and value.
解析:The key inputs to computing the attention value for each word are called the query, key, and value.
- Which of the following correctly represents Attention ?
- A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dkQKT)V
- A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q V T d k ) K Attention(Q,K,V)=softmax(\frac{QV^{T}}{\sqrt{d_k}})K Attention(Q,K,V)=softmax(dkQVT)K
- A t t e n t i o n ( Q , K , V ) = m i n ( Q K T d k ) V Attention(Q,K,V)=min(\frac{QK^{T}}{\sqrt{d_k}})V Attention(Q,K,V)=min(dkQKT)V
- A t t e n t i o n ( Q , K , V ) = m i n ( Q V T d k ) K Attention(Q,K,V)=min(\frac{QV^{T}}{\sqrt{d_k}})K Attention(Q,K,V)=min(dkQVT)K
- Are the following statements true regarding Query (Q), Key (K) and Value (V)?
Q = interesting questions about the words in a sentence
K = specific representations of words given a Q
V = qualities of words given a Q
- False
- True
解析:Q = interesting questions about the words in a sentence, K = qualities of words given a Q, V = specific representations of words given a Q
i here represents the computed attention weight matrix associated with the
i
t
h
ith
ith “word” in a sentence
- False
- True
解析: i i i here represents the computed attention weight matrix associated with the i t h ith ith “head” (sequence).
- Following is the architecture within a Transformer Network (without displaying positional encoding and output layers(s)).
What is generated from the output of the Decoder’s first block of Multi-Head Attention?
- Q
- K
- V
解析:This first block’s output is used to generate the Q matrix for the next Multi-Head Attention block.
- Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s))
What is the output layer(s) of the Decoder ? (Marked Y Y Y, pointed by the independent arrow)
- Softmax layer
- Linear layer
- Linear layer followed by a softmax layer.
- Softmax layer followed by a linear layer.
- Which of the following statements is true about positional encoding? Select all that apply.
- Positional encoding is important because position and word order are essential in sentence construction of any language.
解析:This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.
- Positional encoding uses a combination of sine and cosine equations.
解析This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.
- Positional encoding is used in the transformer network and the attention model.
- Positional encoding provides extra information to our model.
- Which of these is a good criterion for a good positionial encoding algorithm?
- The algorithm should be able to generalize to longer sentences.
- Distance between any two time-steps should be inconsistent for all sentence lengths.
- It must be nondeterministic.
- It should output a common encoding for each time-step (word’s position in a sentence).
更多推荐
所有评论(0)