deeplearningbook_055-2

heardlover

312人浏览 · 2026-04-19 16:51:54

heardlover · 2026-04-19 16:51:54 发布

Batch normalization reparametrizes the mo del to mak e some units alwa ys b e standardized b y deﬁnition, deftly sidestepping b oth problems. A t test time, µ and σ ma y be replaced by running a verages that w ere collected during training time. This allows the mo del to b e ev aluated on a single example, without needing to use deﬁnitions of µ and σ that dep end on an entire minibatc h. Revisiting the ˆ y = xw 1 w 2 . . . w l example, we see that we can mostly resolv e the diﬃculties in learning this mo del b y normalizing h l − 1 . Supp ose that x is drawn from a unit Gaussian. Then h l − 1 will also come from a Gaussian, b ecause the transformation from x to h l is linear. Ho wev er, h l − 1 will no longer hav e zero mean and unit v ariance. After applying batc h normalization, we obtain the normalized ˆ h l − 1 that restores the zero mean and unit v ariance prop erties. F or almost any up date to the lo w er lay ers, ˆ h l − 1 will remain a unit Gaussian. The output ˆ y ma y then b e learned as a simple linear function ˆ y = w l ˆ h l − 1 . Learning in this mo del is no w very simple b ecause the parameters at the lo wer la yers simply do not hav e an eﬀect in most cases; their output is alw a ys renormalized to a unit Gaussian. In some corner cases, the low er lay ers can hav e an eﬀect. Changing one of the lo w er la y er weigh ts to can mak e the output b ecome degenerate, and c hanging the sign 0 of one of the low er weigh ts can ﬂip the relationship b etw een ˆ h l − 1 and y . These situations are v ery rare. Without normalization, nearly every up date would ha v e an extreme eﬀect on the statistics of h l − 1 . Batc h normalization has thus made this mo del signiﬁcan tly easier to learn. In this example, the ease of learning of course came at the cost of making the lo w er lay ers useless. In our linear example, 319 --- Page Break --- CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS the low er lay ers no longer hav e an y harmful eﬀect, but they also no longer hav e an y b eneﬁcial eﬀect. This is b ecause we hav e normalized out the ﬁrst and second order statistics, whic h is all that a linear netw ork can inﬂuence. In a deep neural net w ork with nonlinear activ ation functions, the lo wer la y ers can p erform nonlinear transformations of the data, so they remain useful. Batch normalization acts to standardize only the mean and v ariance of each unit in order to stabilize learning, but allows the relationships b etw een units and the nonlinear statistics of a single unit to c hange. Because the ﬁnal lay er of the netw ork is able to learn a linear transformation, w e ma y actually wish to remov e all linear relationships b etw een units within a la y er. Indeed, this is the approach taken b y ( ), who provided Desjardins et al. 2015 the inspiration for batc h normalization. Unfortunately , eliminating all linear in teractions is muc h more exp ensiv e than standardizing the mean and standard deviation of eac h individual unit, and so far batch normalization remains the most practical approach. Normalizing the mean and standard deviation of a unit can reduce the expressiv e p o w er of the neural net work con taining that unit. In order to main tain the expressiv e p ow er of the netw ork, it is common to replace the batch of hidden unit activ ations H with γ H  + β rather than simply the normalized H  . The v ariables γ and β are learned parameters that allow the new v ariable to ha ve any mean and standard deviation. At ﬁrst glance, this ma y seem useless—why did we set the mean to 0 , and then in tro duce a parameter that allo ws it to b e set bac k to an y arbitrary v alue β ? The answ er is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has diﬀeren t learning dynamics. In the old parametrization, the mean of H w as determined by a complicated in teraction b etw een the parameters in the lay ers b elow H . In the new parametrization, the mean of γ H  + β is determined solely by β . The new parametrization is muc h easier to learn with gradien t descent. Most neural net work lay ers take the form of φ ( X W + b ) where φ is some ﬁxed nonlinear activ ation function suc h as the rectiﬁed linear transformation. It is natural to wonder whether we should apply batch normalization to the input X , or to the transformed v alue X W + b . ( ) recommend Ioﬀe and Szegedy 2015 the latter. More sp eciﬁcally , X W + b should b e replaced by a normalized version of X W . The bias term should b e omitted b ecause it b ecomes redundant with the β parameter applied by the batc h normalization reparametrization . The input to a lay er is usually the output of a nonlinear activ ation function such as the rectiﬁed linear function in a previous lay er. The statistics of the input are thus 320 --- Page Break --- CHAPTER 8. OPTIMIZA TION FOR TRAINING DEEP MODELS more non-Gaussian and less amenable to standardization by linear op erations.

Batch normalization reparametrizes the model to make some units always be standardized by definition, deftly sidestepping both problems.
- 固定搭配:“by definition”意为“根据定义”。
- 句子分析:简单句，“Batch normalization”是主语，“reparametrizes”是谓语，“the model”是宾语，“to make...”是目的状语，“deftly sidestepping both problems”是伴随状语。
- 翻译:批量归一化通过定义重新参数化模型，使一些单元始终保持标准化，巧妙地避开了这两个问题。
- 单词分析:
  - reparametrizes:动词，由“re-”（重新）+“parameterize”（参数化）构成，词义：重新参数化。
    - 记忆方法:“re-”表示“重新”，“parameterize”是“参数化”，合起来就是“重新参数化”。
    - 形近词:parameterize（参数化）。
    - 发音解析:
      - 音节分解:re + pa + ram + e + trize /ˌriːpəˈræmətraɪz/，重音在第二音节
      - 规则:re → /riː/， “re” 发长元音 /iː/。
      - 规则:pa → /pə/， “pa” 发短元音 /ə/。
      - 规则:ram → /ræm/， “ram” 发 /ræm/ 音，其中 “a” 发短元音 /æ/。
      - 规则:e → /ə/， “e” 发短元音 /ə/。
      - 规则:trize → /traɪz/， “trize” 发 /traɪz/ 音，其中 “i” 发长元音 /aɪ/。
- deftly:副词，词源来自“deft”（灵巧的），词义：巧妙地；熟练地。
  - 记忆方法:由“deft”加“-ly”构成副词，“deft”可联想“def”（防守），防守得很灵巧。
  - 形近词:deft（灵巧的）。
  - 发音解析:
    - 音节分解:deft + ly /ˈdeftli/，重音在第一音节
    - 规则:deft → /deft/， “deft” 发 /deft/ 音，其中 “e” 发短元音 /e/。
    - 规则:ly → /li/， “ly” 发 /li/ 音。

At test time, µ and σ may be replaced by running averages that were collected during training time.
- 固定搭配:“at test time”意为“在测试时”；“be replaced by”意为“被……取代”。
- 句子分析:主从复合句，“At test time”是时间状语，主句“µ and σ may be replaced by running averages”是被动语态，“that were collected during training time”是定语从句，修饰“running averages”。
- 翻译:在测试时，µ和σ可以用训练时收集的滑动平均值来代替。
- 单词分析:
  - running:形容词，由“run”（跑）的现在分词形式转化而来，词义：连续的；流动的。
    - 记忆方法:“run”是“跑”，“running”可联想成“一直在跑”，有“连续的”意思。
    - 形近词:run（跑）。
    - 发音解析:
      - 音节分解:run + ning /ˈrʌnɪŋ/，重音在第一音节
      - 规则:run → /rʌn/， “run” 发 /rʌn/ 音，其中 “u” 发短元音 /ʌ/。
      - 规则:ning → /ɪŋ/， “ning” 发 /ɪŋ/ 音。

This allows the model to be evaluated on a single example, without needing to use definitions of µ and σ that depend on an entire minibatch.
- 固定搭配:“allow sb/sth to do sth”意为“允许某人/某物做某事”；“depend on”意为“取决于；依赖”。
- 句子分析:简单句，“This”是主语，“allows”是谓语，“the model”是宾语，“to be evaluated on a single example”是宾语补足语，“without needing to use...”是伴随状语，其中“that depend on an entire minibatch”是定语从句，修饰“definitions”。
- 翻译:这使得模型可以在单个示例上进行评估，而无需使用依赖于整个小批量的µ和σ的定义。
- 单词分析:
  - evaluated:动词过去分词，词源来自“evaluate”（评价；评估），词义：被评估的。
    - 记忆方法:“e-”（出）+“value”（价值）+“-ate”（动词后缀），评估出价值。
    - 形近词:evaluate（评价；评估）。
    - 发音解析:
      - 音节分解:e + val + u + ate + d /ɪˈvæljueɪtɪd/，重音在第二音节
      - 规则:e → /ɪ/， “e” 发短元音 /ɪ/。
      - 规则:val → /væl/， “val” 发 /væl/ 音，其中 “a” 发短元音 /æ/。
      - 规则:u → /juː/， “u” 发长元音 /juː/。
      - 规则:ate → /eɪt/， “ate” 发 /eɪt/ 音。
      - 规则:d → /d/， “d” 发 /d/ 音。

Revisiting the ˆ y = xw 1 w 2... w l example, we see that we can mostly resolve the difficulties in learning this model by normalizing h l − 1.
- 固定搭配:“resolve the difficulties”意为“解决困难”。
- 句子分析:主从复合句，“Revisiting the ˆ y = xw 1 w 2... w l example”是现在分词短语作状语，主句“We see...”中“that we can mostly resolve the difficulties...”是宾语从句。
- 翻译:重新审视ˆ y = xw 1 w 2... w l这个例子，我们发现通过对h l − 1进行归一化，我们大多可以解决学习这个模型时遇到的困难。
- 单词分析:
  - revisiting:动词现在分词，由“re-”（重新）+“visit”（访问；参观）构成，词义：重新审视；再次访问。
    - 记忆方法:“re-”表示“重新”，“visit”是“访问”，合起来就是“重新访问”。
    - 形近词:visit（访问；参观）。
    - 发音解析:
      - 音节分解:re + vis + it + ing /ˌriːˈvɪzɪtɪŋ/，重音在第二音节
      - 规则:re → /riː/， “re” 发长元音 /iː/。
      - 规则:vis → /vɪz/， “vis” 发 /vɪz/ 音，其中 “i” 发短元音 /ɪ/。
      - 规则:it → /ɪt/， “it” 发 /ɪt/ 音。
      - 规则:ing → /ɪŋ/， “ing” 发 /ɪŋ/ 音。
- resolve:动词，词源来自拉丁语“resolvere”（解开；解决），词义：解决；决定。
  - 记忆方法:“re-”（再）+“solve”（解决），再次解决就是“解决”。
  - 形近词:solve（解决）。
  - 发音解析:
    - 音节分解:re + solve /rɪˈzɒlv/，重音在第二音节
    - 规则:re → /rɪ/， “re” 发短元音 /ɪ/。
    - 规则:solve → /zɒlv/， “solve” 发 /zɒlv/ 音，其中 “o” 发短元音 /ɒ/。

Suppose that x is drawn from a unit Gaussian.
- 固定搭配:“be drawn from”意为“从……中抽取”。
- 句子分析:主从复合句，“Suppose that...”是祈使句，“x is drawn from a unit Gaussian”是宾语从句。
- 翻译:假设x是从单位高斯分布中抽取的。
- 单词分析:
  - Gaussian:名词，词源来自数学家高斯（Gauss）的名字，词义：高斯分布；正态分布。
    - 记忆方法:以数学家高斯的名字命名，所以叫“高斯分布”。
    - 形近词:无。
    - 发音解析:
      - 音节分解:Gaus + si + an /ˈɡaʊsiən/，重音在第一音节
      - 规则:Gaus → /ɡaʊs/， “Gaus” 发 /ɡaʊs/ 音，其中 “au” 发双元音 /aʊ/。
      - 规则:si → /si/， “si” 发 /si/ 音。
      - 规则:an → /ən/， “an” 发 /ən/ 音。

Then h l − 1 will also come from a Gaussian, because the transformation from x to h l is linear.
- 固定搭配:“come from”意为“来自”。
- 句子分析:主从复合句，“Then h l − 1 will also come from a Gaussian”是主句，“because the transformation from x to h l is linear”是原因状语从句。
- 翻译:那么h l − 1也将来自高斯分布，因为从x到h l的变换是线性的。
- 单词分析:
  - transformation:名词，由“transform”（转变；变换）加“-ation”（名词后缀）构成，词义：转变；变换。
    - 记忆方法:“trans-”（转变）+“form”（形式）+“-ation”（名词后缀），转变形式就是“变换”。
    - 形近词:transform（转变；变换）。
    - 发音解析:
      - 音节分解:trans + form + a + tion /ˌtrænsfəˈmeɪʃn/，重音在第二音节
      - 规则:trans → /træns/， “trans” 发 /træns/ 音，其中 “a” 发短元音 /æ/。
      - 规则:form → /fɔːm/， “form” 发 /fɔːm/ 音，其中 “o” 发长元音 /ɔː/。
      - 规则:a → /ə/， “a” 发短元音 /ə/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。

However, h l − 1 will no longer have zero mean and unit variance.
- 固定搭配:“no longer”意为“不再”。
- 句子分析:简单句，主谓宾结构。
- 翻译:然而，h l − 1将不再具有零均值和单位方差。
- 单词分析:
  - variance:名词，词源来自“vary”（变化；改变），词义：方差；差异。
    - 记忆方法:“vary”是“变化”，“-ance”是名词后缀，有变化就有“差异”“方差”。
    - 形近词:vary（变化；改变）。
    - 发音解析:
      - 音节分解:var + i + ance /ˈveəriəns/，重音在第一音节
      - 规则:var → /veə/， “var” 发 /veə/ 音，其中 “a” 发双元音 /eə/。
      - 规则:i → /i/， “i” 发短元音 /i/。
      - 规则:ance → /əns/， “ance” 发 /əns/ 音。

After applying batch normalization, we obtain the normalized ˆ h l − 1 that restores the zero mean and unit variance properties.
- 固定搭配:“apply...to...”意为“将……应用于……”。
- 句子分析:主从复合句，“After applying batch normalization”是时间状语，主句“We obtain the normalized ˆ h l − 1”，“that restores the zero mean and unit variance properties”是定语从句，修饰“ˆ h l − 1”。
- 翻译:应用批量归一化后，我们得到归一化后的ˆ h l − 1，它恢复了零均值和单位方差的特性。
- 单词分析:
  - normalized:动词过去分词，由“normalize”（使标准化；使正常化）加“-ed”构成，词义：归一化的；标准化的。
    - 记忆方法:“normal”是“正常的”，“-ize”是动词后缀，“-ed”是过去分词后缀，使变得正常就是“归一化”。
    - 形近词:normalize（使标准化；使正常化）。
    - 发音解析:
      - 音节分解:nor + mal + ize + d /ˈnɔːməlaɪzd/，重音在第一音节
      - 规则:nor → /nɔː/， “nor” 发 /nɔː/ 音，其中 “o” 发长元音 /ɔː/。
      - 规则:mal → /məl/， “mal” 发 /məl/ 音，其中 “a” 发短元音 /ə/。
      - 规则:ize → /aɪz/， “ize” 发 /aɪz/ 音，其中 “i” 发长元音 /aɪ/。
      - 规则:d → /d/， “d” 发 /d/ 音。
- restores:动词，词源来自拉丁语“restaurare”（恢复；修复），词义：恢复；修复。
  - 记忆方法:“re-”（重新）+“store”（储存），重新储存就是“恢复”。
  - 形近词:restore（恢复；修复）。
  - 发音解析:
    - 音节分解:re + store /rɪˈstɔː(r)/，重音在第二音节
    - 规则:re → /rɪ/， “re” 发短元音 /ɪ/。
    - 规则:store → /stɔː(r)/， “store” 发 /stɔː(r)/ 音，其中 “o” 发长元音 /ɔː/。

For almost any update to the lower layers, ˆ h l − 1 will remain a unit Gaussian.
- 固定搭配:“for”表示“对于”。
- 句子分析:简单句，主谓宾结构。
- 翻译:对于下层的几乎任何更新，ˆ h l − 1将仍然是单位高斯分布。

The output ˆ y may then be learned as a simple linear function ˆ y = w l ˆ h l − 1.
- 固定搭配:“be learned as”意为“被学习为”。
- 句子分析:简单句，“The output ˆ y”是主语，“may be learned”是谓语，“as a simple linear function ˆ y = w l ˆ h l − 1”是方式状语。
- 翻译:然后，输出ˆ y可以作为一个简单的线性函数ˆ y = w l ˆ h l − 1来学习。

Learning in this model is now very simple because the parameters at the lower layers simply do not have an effect in most cases; their output is always renormalized to a unit Gaussian.
- 固定搭配:“have an effect”意为“有影响”；“renormalize to”意为“重新归一化到”。
- 句子分析:主从复合句，“Learning in this model is now very simple”是主句，“because the parameters at the lower layers simply do not have an effect in most cases”是原因状语从句，“;”后面是一个并列句。
- 翻译:在这个模型中学习现在非常简单，因为在大多数情况下，下层的参数根本没有影响；它们的输出总是被重新归一化到单位高斯分布。
- 单词分析:
  - renormalized:动词过去分词，由“re-”（重新）+“normalize”（使标准化；使正常化）加“-ed”构成，词义：重新归一化的。
    - 记忆方法:“re-”表示“重新”，“normalize”是“归一化”，合起来就是“重新归一化”。
    - 形近词:normalize（使标准化；使正常化）。
    - 发音解析:
      - 音节分解:re + nor + mal + ize + d /ˌriːˈnɔːməlaɪzd/，重音在第二音节
      - 规则:re → /riː/， “re” 发长元音 /iː/。
      - 规则:nor → /nɔː/， “nor” 发 /nɔː/ 音，其中 “o” 发长元音 /ɔː/。
      - 规则:mal → /məl/， “mal” 发 /məl/ 音，其中 “a” 发短元音 /ə/。
      - 规则:ize → /aɪz/， “ize” 发 /aɪz/ 音，其中 “i” 发长元音 /aɪ/。
      - 规则:d → /d/， “d” 发 /d/ 音。

In some corner cases, the lower layers can have an effect.
- 固定搭配:“in some corner cases”意为“在某些特殊情况下”。
- 句子分析:简单句，主谓宾结构。
- 翻译:在某些特殊情况下，下层可以产生影响。

Changing one of the lower layer weights to can make the output become degenerate, and changing the sign 0 of one of the lower weights can flip the relationship between ˆ h l − 1 and y.
- 固定搭配:“make...become...”意为“使……变成……”；“flip the relationship”意为“改变关系”。
- 句子分析:并列句，由“and”连接两个并列的简单句，每个简单句的主语都是动名词短语。
- 翻译:将下层的一个权重改变为……可以使输出变得退化，而改变下层一个权重的符号0可以改变ˆ h l − 1和y之间的关系。
- 单词分析:
  - degenerate:动词，词源来自拉丁语“degenerare”（退化；堕落），词义：退化；变质。
    - 记忆方法:“de-”（向下）+“generate”（产生），向下产生就是“退化”。
    - 形近词:generate（产生；生成）。
    - 发音解析:
      - 音节分解:de + gen + er + ate /dɪˈdʒenəreɪt/，重音在第二音节
      - 规则:de → /dɪ/， “de” 发短元音 /ɪ/。
      - 规则:gen → /dʒen/， “gen” 发 /dʒen/ 音，其中 “g” 发 /dʒ/ 音。
      - 规则:er → /ə(r)/， “er” 发 /ə(r)/ 音。
      - 规则:ate → /eɪt/， “ate” 发 /eɪt/ 音。
- flip:动词，词源可能来自拟声词，词义：翻转；改变。
  - 记忆方法:联想“flip”的发音，像东西翻转的声音。
  - 形近词:flop（失败；扑通一声落下）。
  - 发音解析:
    - 音节分解:flip /flɪp/，重音在第一音节
    - 规则:fl → /fl/， “fl” 发 /fl/ 音。
    - 规则:i → /ɪ/， “i” 发短元音 /ɪ/。
    - 规则:p → /p/， “p” 发 /p/ 音。

These situations are very rare.
- 句子分析:简单主系表结构句子。
- 翻译:"这些情况非常罕见。"

Without normalization, nearly every update would have an extreme effect on the statistics of h l − 1.
- 固定搭配:"have an effect on"意为 "对……有影响"。
- 句子分析:这是一个虚拟语气的句子，“Without normalization”是条件状语，主句是“nearly every update would have an extreme effect...”。
- 翻译:"如果不进行归一化，几乎每次更新都会对h l − 1的统计数据产生极大的影响。"
- 单词分析:
  - normalization:名词，词源来自“normal”（正常的）+“-ization”（名词后缀，表示“……化”），词义：归一化；正常化。
    - 记忆方法:联想“normal”（正常的）变成名词形式，就是使其达到正常状态 → 归一化。
    - 形近词:normalization/normalize（使正常化；使标准化）、abnormalization（反常化）。
    - 发音解析:
      - 音节分解:nor + mal + i + za + tion /ˌnɔːrmələˈzeɪʃn/，重音在第二音节
      - 规则:nor → /nɔːr/， “nor” 发 /nɔːr/ 音，其中 “n” 发鼻音，“o” 发长元音 /ɔː/，“r” 发卷舌音。
      - 规则:mal → /mæl/， “mal” 发 /mæl/ 音，其中 “m” 发 /m/ 音，“a” 发短元音 /æ/，“l” 发 /l/ 音。
      - 规则:i → /ɪ/， “i” 发短元音 /ɪ/。
      - 规则:za → /zeɪ/， “za” 发 /zeɪ/ 音，其中 “z” 发 /z/ 音，“a” 发长元音 /eɪ/。
      - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
- extreme:形容词，词源来自拉丁语 "extremus"（最外面的），词义：极度的；极端的。
  - 记忆方法:联想 "ex-"（向外）+"treme"（终点）→ 到达终点 → 极端的。
  - 形近词:extreme/extremist（极端主义者）、exterior（外部的）。
  - 发音解析:
    - 音节分解:ex + tre + me;/ɪkˈstriːm/，重音在第一音节
    - 规则:ex → /ɪɡˈz/， “ex” 发 /ɪɡˈz/ 音，其中 “e” 发短元音 /ɪ/，“x” 发 /z/ 音。
    - 规则:tre → /trē/， “tre” 发长音 /ē/，类似于 “tree” 的发音。
    - 规则:me → /mɪ/， “me” 发短音 /mɪ/，类似于 “me” 的发音。

Batch normalization has thus made this model significantly easier to learn.
- 句子分析:“make sth. + adj.”结构，“Batch normalization”是主语，“has made”是谓语，“this model”是宾语，“significantly easier to learn”是宾语补足语。
- 翻译:"因此，批量归一化使这个模型明显更容易学习。"
- 单词分析:
  - significantly:副词，词源来自“significant”（重要的；显著的）+“-ly”（副词后缀），词义：显著地；重要地。
    - 记忆方法:由“significant”加副词后缀“-ly”而来，记住“significant”就容易记住它。
    - 形近词:significantly/significance（重要性；意义）、insignificantly（无关紧要地）。
    - 发音解析:
      - 音节分解:sig + ni + fi + cant + ly /sɪɡˈnɪfɪkəntli/，重音在第二音节
      - 规则:sig → /sɪɡ/， “sig” 发 /sɪɡ/ 音，其中 “s” 发 /s/ 音，“i” 发短元音 /ɪ/，“g” 发 /ɡ/ 音。
      - 规则:ni → /nɪ/， “ni” 发短元音 /nɪ/。
      - 规则:fi → /fɪ/， “fi” 发短元音 /fɪ/。
      - 规则:cant → /kənt/， “cant” 发 /kənt/ 音，其中 “c” 发 /k/ 音，“a” 发短元音 /ə/，“n” 发鼻音，“t” 发 /t/ 音。
      - 规则:ly → /li/， “ly” 发 /li/ 音。

In this example, the ease of learning of course came at the cost of making the lower layers useless.
- 固定搭配:"come at the cost of"意为 "以……为代价"。
- 句子分析:简单句，“the ease of learning”是主语，“came”是谓语。
- 翻译:"在这个例子中，学习的轻松当然是以让较低层变得无用为代价的。"
- 单词分析:
  - ease:名词，词源来自古法语 “aise”，词义：容易；轻松。
    - 记忆方法:联想“easy”（容易的）去掉“y”变成名词形式。
    - 形近词:ease/cease（停止）、tease（取笑）。
    - 发音解析:
      - 音节分解:ease /iːz/，单音节词，重音在本身
      - 规则:ea → /iː/， “ea” 发长元音 /iː/。
      - 规则:se → /z/， “se” 发 /z/ 音。

In our linear example, 319 --- Page Break --- CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS the lower layers no longer have any harmful effect, but they also no longer have any beneficial effect.
- 固定搭配:"no longer"意为 "不再"。
- 句子分析:并列句，由“but”连接两个并列的句子。
- 翻译:"在我们的线性例子中，第319页 --- 第8章。深度模型训练的优化较低层不再有任何有害影响，但它们也不再有任何有益影响。"
- 单词分析:
  - harmful:形容词，词源来自“harm”（伤害）+“-ful”（形容词后缀，表示“充满……的”），词义：有害的。
    - 记忆方法:“harm”（伤害）加上“-ful”，就是充满伤害的 → 有害的。
    - 形近词:harmful/harmless（无害的）、harmonious（和谐的）。
    - 发音解析:
      - 音节分解:harm + ful /ˈhɑːrml/，重音在第一音节
      - 规则:harm → /hɑːrm/， “harm” 发 /hɑːrm/ 音，其中 “h” 发 /h/ 音，“a” 发长元音 /ɑː/，“r” 发卷舌音，“m” 发 /m/ 音。
      - 规则:ful → /fl/， “ful” 发 /fl/ 音。
- beneficial:形容词，词源来自拉丁语 “beneficium”（恩惠），词义：有益的；有利的。
  - 记忆方法:联想“bene-”（好）+“fic”（做）+“-ial”（形容词后缀）→ 做好事的 → 有益的。
  - 形近词:beneficial/benefit（利益；好处）、maleficial（有害的）。
  - 发音解析:
    - 音节分解:be + ne + fi + cial /ˌbenɪˈfɪʃl/，重音在第二音节
    - 规则:be → /bi/， “be” 发 /bi/ 音，其中 “b” 发 /b/ 音，“e” 发长元音 /i/。
    - 规则:ne → /nɪ/， “ne” 发短元音 /nɪ/。
    - 规则:fi → /fɪ/， “fi” 发短元音 /fɪ/。
    - 规则:cial → /ʃl/， “cial” 发 /ʃl/ 音。

This is because we have normalized out the first and second order statistics, which is all that a linear network can influence.
- 句子分析:主系表结构，“This”是主语，“is”是系动词，“because...”引导表语从句，“which...”是非限定性定语从句，修饰前面整个句子。
- 翻译:"这是因为我们已经将一阶和二阶统计数据归一化了，而这正是线性网络所能影响的全部内容。"
- 单词分析:
  - normalized:动词过去式，词源来自“normal”（正常的）+“-ize”（动词后缀，表示“使……化”），词义：使正常化；使标准化。
    - 记忆方法:“normal”（正常的）加上“-ize”，就是使其达到正常状态 → 使正常化。
    - 形近词:normalized/normalize（使正常化；使标准化）、abnormalize（使反常化）。
    - 发音解析:
      - 音节分解:nor + mal + ize + d /ˈnɔːrməlaɪzd/，重音在第一音节
      - 规则:nor → /nɔːr/， “nor” 发 /nɔːr/ 音，其中 “n” 发鼻音，“o” 发长元音 /ɔː/，“r” 发卷舌音。
      - 规则:mal → /mæl/， “mal” 发 /mæl/ 音，其中 “m” 发 /m/ 音，“a” 发短元音 /æ/，“l” 发 /l/ 音。
      - 规则:ize → /aɪz/， “ize” 发 /aɪz/ 音，其中 “i” 发长元音 /aɪ/，“z” 发 /z/ 音。
      - 规则:d → /d/， “d” 发 /d/ 音。

In a deep neural network with nonlinear activation functions, the lower layers can perform nonlinear transformations of the data, so they remain useful.
- 句子分析:复合句，“In a deep neural network with nonlinear activation functions”是状语，“the lower layers can perform...”是主句，“so...”引导结果状语从句。
- 翻译:"在具有非线性激活函数的深度神经网络中，较低层可以对数据进行非线性变换，因此它们仍然有用。"
- 单词分析:
  - nonlinear:形容词，词源来自“non-”（否定前缀）+“linear”（线性的），词义：非线性的。
    - 记忆方法:“non-”表示否定，“linear”是线性的，合起来就是非线性的。
    - 形近词:nonlinear/linear（线性的）、bilinear（双线性的）。
    - 发音解析:
      - 音节分解:non + li + near /ˈnɑːnˈlɪniər/，重音在第一音节
      - 规则:non → /nɑːn/， “non” 发 /nɑːn/ 音，其中 “n” 发鼻音，“o” 发长元音 /ɑː/。
      - 规则:li → /lɪ/， “li” 发短元音 /lɪ/。
      - 规则:near → /nɪr/， “near” 发 /nɪr/ 音，其中 “n” 发鼻音，“e” 发短元音 /ɪ/，“r” 发卷舌音。
- activation:名词，词源来自“activate”（激活）+“-ion”（名词后缀），词义：激活；活化。
  - 记忆方法:“activate”去掉“e”加上“-ion”变成名词形式。
  - 形近词:activation/activate（激活）、deactivation（去激活）。
  - 发音解析:
    - 音节分解:ac + ti + va + tion /ˌæktɪˈveɪʃn/，重音在第二音节
    - 规则:ac → /æk/， “ac” 发 /æk/ 音，其中 “a” 发短元音 /æ/，“c” 发 /k/ 音。
    - 规则:ti → /tɪ/， “ti” 发短元音 /tɪ/。
    - 规则:va → /veɪ/， “va” 发 /veɪ/ 音，其中 “v” 发 /v/ 音，“a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
- transformations:名词复数，词源来自“transform”（转变；转换）+“-ation”（名词后缀），词义：变换；转化。
- - 记忆方法:“transform”加上“-ation”变成名词形式。
  - 形近词:transformations/transform（转变；转换）、transformation（转变；转换）。
  - 发音解析:
    - 音节分解:trans + for + ma + tion + s /ˌtrænsfərˈmeɪʃnz/，重音在第二音节
    - 规则:trans → /træns/， “trans” 发 /træns/ 音，其中 “t” 发 /t/ 音，“r” 发卷舌音，“a” 发短元音 /æ/，“n” 发鼻音，“s” 发 /s/ 音。
    - 规则:for → /fɔːr/， “for” 发 /fɔːr/ 音，其中 “f” 发 /f/ 音，“o” 发长元音 /ɔː/，“r” 发卷舌音。
    - 规则:ma → /mɑː/， “ma” 发 /mɑː/ 音，其中 “m” 发 /m/ 音，“a” 发长元音 /ɑː/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。
    - 规则:s → /z/， “s” 发 /z/ 音。

Batch normalization acts to standardize only the mean and variance of each unit in order to stabilize learning, but allows the relationships between units and the nonlinear statistics of a single unit to change.
- 固定搭配:"act to"意为 "采取行动做某事"；"in order to"意为 "为了"。
- 句子分析:并列句，由“but”连接两个并列的句子。
- 翻译:"批量归一化的作用是仅对每个单元的均值和方差进行标准化，以稳定学习，但允许单元之间的关系以及单个单元的非线性统计数据发生变化。"
- 单词分析:
  - standardize:动词，词源来自“standard”（标准）+“-ize”（动词后缀，表示“使……化”），词义：使标准化；使符合标准。
    - 记忆方法:“standard”加上“-ize”，就是使其达到标准 → 使标准化。
    - 形近词:standardize/standard（标准）、standardization（标准化）。
    - 发音解析:
      - 音节分解:stan + dard + ize /ˈstændəraɪz/，重音在第一音节
      - 规则:stan → /stæn/， “stan” 发 /stæn/ 音，其中 “s” 发 /s/ 音，“t” 发 /t/ 音，“a” 发短元音 /æ/，“n” 发鼻音。
      - 规则:dard → /dərd/， “dard” 发 /dərd/ 音，其中 “d” 发 /d/ 音，“a” 发短元音 /ə/，“r” 发卷舌音，“d” 发 /d/ 音。
      - 规则:ize → /aɪz/， “ize” 发 /aɪz/ 音，其中 “i” 发长元音 /aɪ/，“z” 发 /z/ 音。
- stabilize:动词，词源来自“stable”（稳定的）+“-ize”（动词后缀，表示“使……化”），词义：使稳定；使稳固。
  - 记忆方法:“stable”加上“-ize”，就是使其变得稳定 → 使稳定。
  - 形近词:stabilize/stable（稳定的）、stability（稳定性）。
  - 发音解析:
    - 音节分解:sta + bi + lize /ˈsteɪbəlaɪz/，重音在第一音节
    - 规则:sta → /steɪ/， “sta” 发 /steɪ/ 音，其中 “s” 发 /s/ 音，“t” 发 /t/ 音，“a” 发长元音 /eɪ/。
    - 规则:bi → /bɪ/， “bi” 发短元音 /bɪ/。
    - 规则:lize → /laɪz/， “lize” 发 /laɪz/ 音，其中 “l” 发 /l/ 音，“i” 发长元音 /aɪ/，“z” 发 /z/ 音。

Because the final layer of the network is able to learn a linear transformation, we may actually wish to remove all linear relationships between units within a layer.
- 句子分析:复合句，“Because...”引导原因状语从句，“we may actually wish to...”是主句。
- 翻译:"因为网络的最后一层能够学习线性变换，所以我们实际上可能希望消除层内单元之间的所有线性关系。"

Indeed, this is the approach taken by ( ), who provided Desjardins et al. 2015 the inspiration for batch normalization.
- 句子分析:主从复合句，“who...”引导定语从句，修饰“( )”。
- 翻译:"实际上，这就是（）所采用的方法，他们在2015年为批量归一化提供了灵感，作者是德雅尔丹斯等人。"
- 单词分析:
  - approach:名词，词源来自拉丁语 “appropiare”（接近），词义：方法；途径。
    - 记忆方法:联想“ap-”（向）+“proach”（接近）→ 接近问题的方式 → 方法。
    - 形近词:approach/reproach（责备）、appropriate（合适的；挪用）。
    - 发音解析:
      - 音节分解:ap + proach /əˈproʊtʃ/，重音在第二音节
      - 规则:ap → /əp/， “ap” 发 /əp/ 音，其中 “a” 发短元音 /ə/，“p” 发 /p/ 音。
      - 规则:proach → /proʊtʃ/， “proach” 发 /proʊtʃ/ 音，其中 “p” 发 /p/ 音，“r” 发卷舌音，“o” 发长元音 /oʊ/，“a” 发短元音 /ə/，“ch” 发 /tʃ/ 音。
- inspiration:名词，词源来自拉丁语 “inspirare”（吸气；鼓舞），词义：灵感；启发。
  - 记忆方法:联想“in-”（进入）+“spire”（呼吸）+“-ation”（名词后缀）→ 吸入灵感 → 灵感。
  - 形近词:inspiration/inspire（鼓舞；启发）、aspiration（抱负；渴望）。
  - 发音解析:
    - 音节分解:in + spi + ra + tion /ˌɪnspəˈreɪʃn/，重音在第二音节
    - 规则:in → /ɪn/， “in” 发 /ɪn/ 音，其中 “i” 发短元音 /ɪ/，“n” 发鼻音。
    - 规则:spi → /spɪ/， “spi” 发短元音 /spɪ/。
    - 规则:ra → /reɪ/， “ra” 发 /reɪ/ 音，其中 “r” 发卷舌音，“a” 发长元音 /eɪ/。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。

Unfortunately, eliminating all linear interactions is much more expensive than standardizing the mean and standard deviation of each individual unit, and so far batch normalization remains the most practical approach.
- 固定搭配:“so far”意为“到目前为止”。
- 句子分析:这是一个并列复合句，由“and”连接两个分句。前一个分句中“eliminating all linear interactions”和“standardizing the mean and standard deviation of each individual unit”是动名词短语作主语，进行比较；后一个分句表达目前批量归一化是最实用的方法。
- 翻译:不幸的是，消除所有线性交互比标准化每个单独单元的均值和标准差要昂贵得多，到目前为止，批量归一化仍然是最实用的方法。
- 单词分析:
  - eliminating:动名词，词源来自拉丁语“eliminare”（排除），词义：消除；排除。
    - 记忆方法:“e-”（出）+“limin”（门槛）+“-ate”（动词后缀）→ 赶出门槛 → 消除。
    - 形近词:eliminate/elimination（消除）、limitation（限制）。
    - 发音解析:
      - 音节分解:e + lim + i + nat + ing /ɪˌlɪmɪˈneɪtɪŋ/，重音在第二音节。
      - 规则:e → /ɪ/，发短元音。
      - 规则:lim → /lɪm/，“i”发短元音。
      - 规则:i → /ɪ/，发短元音。
      - 规则:nat → /neɪt/，“a”发长元音。
      - 规则:ing → /ɪŋ/，发后鼻音。
- standardizing:动名词，词源来自“standard”（标准），词义：使标准化。
  - 记忆方法:“standard”（标准）+“-ize”（使……化）→ 使标准化。
  - 形近词:standardize/standardization（标准化）、normalize（使正常化）。
  - 发音解析:
    - 音节分解:stan + dar + diz + ing /ˈstændədaɪzɪŋ/，重音在第一音节。
    - 规则:stan → /stæn/，“a”发短元音。
    - 规则:dar → /dɑː(r)/，“a”发长元音。
    - 规则:diz → /daɪz/，“i”发长元音。
    - 规则:ing → /ɪŋ/，发后鼻音。
- practical:形容词，词源来自希腊语“praktikos”（实际的），词义：实际的；实用的。
  - 记忆方法:“pract”（实践）+“-ical”（形容词后缀）→ 实践的 → 实际的。
  - 形近词:practical/practice（实践）、practitioner（从业者）。
  - 发音解析:
    - 音节分解:prac + ti + cal /ˈpræktɪkl/，重音在第一音节。
    - 规则:prac → /præk/，“a”发短元音。
    - 规则:ti → /tɪ/，发短音。
    - 规则:cal → /kl/，发音类似“cl”。

Normalizing the mean and standard deviation of a unit can reduce the expressive power of the neural network containing that unit.
- 句子分析:简单句，“Normalizing the mean and standard deviation of a unit”是动名词短语作主语，谓语是“can reduce”，宾语是“the expressive power of the neural network containing that unit”。
- 翻译:对一个单元的均值和标准差进行归一化可以降低包含该单元的神经网络的表达能力。
- 单词分析:
  - normalizing:动名词，词源来自“normal”（正常的），词义：使正常化；归一化。
    - 记忆方法:“normal”（正常的）+“-ize”（使……化）→ 使正常化。
    - 形近词:normalize/normalization（正常化）、abnormal（不正常的）。
    - 发音解析:
      - 音节分解:nor + mal + ize + ing /ˈnɔːməlaɪzɪŋ/，重音在第一音节。
      - 规则:nor → /nɔː(r)/，“o”发长元音。
      - 规则:mal → /məl/，“a”发短元音。
      - 规则:ize → /aɪz/，“i”发长元音。
      - 规则:ing → /ɪŋ/，发后鼻音。
- expressive:形容词，词源来自“express”（表达），词义：有表现力的；表达的。
  - 记忆方法:“express”（表达）+“-ive”（形容词后缀）→ 有表达能力的。
  - 形近词:expressive/expression（表达）、impressive（令人印象深刻的）。
  - 发音解析:
    - 音节分解:ex + pres + sive /ɪkˈspresɪv/，重音在第二音节。
    - 规则:ex → /ɪkˈs/，“e”发短元音。
    - 规则:pres → /pres/，“e”发短元音。
    - 规则:sive → /sɪv/，发音类似“siv”。

In order to maintain the expressive power of the network, it is common to replace the batch of hidden unit activations H with γ H  + β rather than simply the normalized H .
- 固定搭配:“in order to”意为“为了”；“rather than”意为“而不是”。
- 句子分析:这是一个复合句，“In order to maintain the expressive power of the network”是目的状语，“it”是形式主语，真正的主语是“to replace the batch of hidden unit activations H with γ H  + β rather than simply the normalized H ”。
- 翻译:为了保持网络的表达能力，通常用γ H  + β 来替换一批隐藏单元的激活值H，而不是仅仅使用归一化后的H 。
- 单词分析:
  - maintain:动词，词源来自拉丁语“manutenere”（手持；维持），词义：维持；保持。
    - 记忆方法:“main”（手）+“tain”（拿）→ 用手拿着 → 维持。
    - 形近词:maintain/maintenance（维护）、sustain（维持；支撑）。
    - 发音解析:
      - 音节分解:main + tain /meɪnˈteɪn/，重音在第二音节。
      - 规则:main → /meɪn/，“ai”发长元音。
      - 规则:tain → /teɪn/，“ai”发长元音。
- replace:动词，词源来自拉丁语“re-”（再）+“placare”（放置），词义：替换；取代。
  - 记忆方法:“re-”（再）+“place”（放置）→ 再次放置 → 替换。
  - 形近词:replace/replacement（替换物）、displace（取代；移置）。
  - 发音解析:
    - 音节分解:re + place /rɪˈpleɪs/，重音在第二音节。
    - 规则:re → /rɪ/，发短元音。
    - 规则:place → /pleɪs/，“a”发长元音。

The variables γ and β are learned parameters that allow the new variable to have any mean and standard deviation.
- 句子分析:这是一个主从复合句，“that allow the new variable to have any mean and standard deviation”是定语从句，修饰先行词“parameters”。
- 翻译:变量γ和β是经过学习得到的参数，它们允许新变量具有任意的均值和标准差。
- 单词分析:
  - variables:名词复数，词源来自拉丁语“variabilis”（可变的），词义：变量。
    - 记忆方法:“vari”（变化）+“-able”（可……的）+“-s”（复数后缀）→ 可变化的东西 → 变量。
    - 形近词:variable/variation（变化）、variety（多样；种类）。
    - 发音解析:
      - 音节分解:var + i + ab + les /ˈveəriəblz/，重音在第一音节。
      - 规则:var → /veə(r)/，“a”发音类似“air”。
      - 规则:i → /ɪ/，发短元音。
      - 规则:ab → /əbl/，发音类似“able”。
      - 规则:les → /lz/，发音类似“lz”。
- parameters:名词复数，词源来自希腊语“para-”（旁边）+“metron”（测量），词义：参数；参量。
  - 记忆方法:“para-”（旁边）+“meter”（测量）+“-s”（复数后缀）→ 旁边用于测量的东西 → 参数。
  - 形近词:parameter/perimeter（周长）、diameter（直径）。
  - 发音解析:
    - 音节分解:pa + ram + e + ters /pəˈræmɪtəz/，重音在第二音节。
    - 规则:pa → /pə/，发短元音。
    - 规则:ram → /ræm/，“a”发短元音。
    - 规则:e → /ɪ/，发短元音。
    - 规则:ters → /təz/，发音类似“təz”。

At first glance, this may seem useless—why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics.
- 固定搭配:“at first glance”意为“乍一看”。
- 句子分析:这是一个复杂的复合句，包含疑问句和解释性的回答。疑问句中“that allows it to be set back to any arbitrary value β”是定语从句修饰“parameter”；回答中“that the new parametrization can represent the same family of functions of the input as the old parametrization”是表语从句。
- 翻译:乍一看，这可能似乎毫无用处 —— 为什么我们要把均值设为0，然后又引入一个参数让它可以被设置回任意值β呢？答案是，新的参数化方式可以表示与旧参数化方式相同的输入函数族，但新的参数化方式具有不同的学习动态。
- 单词分析:
  - arbitrary:形容词，词源来自拉丁语“arbitrarius”（由法官决定的），词义：任意的；武断的。
    - 记忆方法:“arbitr”（法官）+“-ary”（形容词后缀）→ 像法官一样随意决定的 → 任意的。
    - 形近词:arbitrary/arbitrate（仲裁）、arbitration（仲裁）。
    - 发音解析:
      - 音节分解:ar + bi + tra + ry /ˈɑːbɪtrəri/，重音在第一音节。
      - 规则:ar → /ɑː(r)/，“a”发长元音。
      - 规则:bi → /bɪ/，发短元音。
      - 规则:tra → /trə/，发短音。
      - 规则:ry → /ri/，发音类似“ri”。
- parametrization:名词，词源来自“parameter”（参数）+“-ize”（使……化）+“-ation”（名词后缀），词义：参数化。
  - 记忆方法:“parameter”（参数）+“-ize”（使……化）+“-ation”（名词后缀）→ 参数化。
  - 形近词:parametrization/parameterize（参数化）、standardization（标准化）。
  - 发音解析:
    - 音节分解:pa + ram + e + tri + za + tion /pəˌræmɪtraɪˈzeɪʃn/，重音在倒数第三音节。
    - 规则:pa → /pə/，发短元音。
    - 规则:ram → /ræm/，“a”发短元音。
    - 规则:e → /ɪ/，发短元音。
    - 规则:tri → /traɪ/，“i”发长元音。
    - 规则:za → /zeɪ/，发音类似“zay”。
    - 规则:tion → /ʃn/，发音类似“shn”。
- dynamics:名词，词源来自希腊语“dynamis”（力量），词义：动力学；动态。
  - 记忆方法:“dynam”（力量）+“-ics”（学科后缀）→ 研究力量的学科 → 动力学。
  - 形近词:dynamics/dynamic（动态的）、kinetics（动力学）。
  - 发音解析:
    - 音节分解:dy + nam + ics /daɪˈnæmɪks/，重音在第二音节。
    - 规则:dy → /daɪ/，“y”发长元音。
    - 规则:nam → /næm/，“a”发短元音。
    - 规则:ics → /ɪks/，发音类似“iks”。

In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H.
- 句子分析:简单句，“the mean of H”是主语，“was determined by”是被动语态的谓语，“a complicated interaction between the parameters in the layers below H”是动作的执行者。
- 翻译:在旧的参数化方式中，H的均值是由H下面各层参数之间复杂的相互作用决定的。
- 单词分析:
  - complicated:形容词，词源来自“complicate”（使复杂），词义：复杂的。
    - 记忆方法:“com-”（一起）+“plic”（折叠）+“-ate”（动词后缀）+“-ed”（形容词后缀）→ 一起折叠起来的 → 复杂的。
    - 形近词:complicated/complicate（使复杂）、implicate（牵涉）。
    - 发音解析:
      - 音节分解:com + pli + cat + ed /ˈkɒmplɪkeɪtɪd/，重音在第一音节。
      - 规则:com → /kɒm/，“o”发短元音。
      - 规则:pli → /plɪ/，发短音。
      - 规则:cat → /keɪt/，“a”发长元音。
      - 规则:ed → /ɪd/，发音类似“id”。

In the new parametrization, the mean of γ H  + β is determined solely by β.
- 句子分析:简单句，“the mean of γ H  + β”是主语，“is determined by”是被动语态的谓语，“β”是动作的执行者。
- 翻译:在新的参数化方式中，γ H  + β的均值仅由β决定。
- 单词分析:
  - solely:副词，词源来自“sole”（唯一的），词义：仅仅；只。
    - 记忆方法:“sole”（唯一的）+“-ly”（副词后缀）→ 唯一地 → 仅仅。
    - 形近词:solely/sole（唯一的）、solo（独奏；单独的）。
    - 发音解析:
      - 音节分解:so + le + ly /ˈsəʊlli/，重音在第一音节。
      - 规则:so → /səʊ/，“o”发长元音。
      - 规则:le → /l/，发音类似“l”。
      - 规则:ly → /li/，发音类似“li”。

The new parametrization is much easier to learn with gradient descent.
- 固定搭配:“gradient descent”意为“梯度下降”。
- 句子分析:简单句，“The new parametrization”是主语，“is”是系动词，“much easier to learn with gradient descent”是表语。
- 翻译:新的参数化方式用梯度下降法学习要容易得多。

Most neural network layers take the form of φ ( X W + b ) where φ is some fixed nonlinear activation function such as the rectified linear transformation.
- 句子分析:这是一个主从复合句，“where φ is some fixed nonlinear activation function such as the rectified linear transformation”是定语从句，修饰先行词“φ ( X W + b )”。
- 翻译:大多数神经网络层采用φ ( X W + b )的形式，其中φ是某种固定的非线性激活函数，如整流线性变换。
- 单词分析:
  - nonlinear:形容词，词源来自“non-”（非）+“linear”（线性的），词义：非线性的。
    - 记忆方法:“non-”（非）+“linear”（线性的）→ 非线性的。
    - 形近词:nonlinear/linear（线性的）、linearly（线性地）。
    - 发音解析:
      - 音节分解:non + lin + ear /ˌnɒnˈlɪniə(r)/，重音在第二音节。
      - 规则:non → /nɒn/，“o”发短元音。
      - 规则:lin → /lɪn/，发短音。
      - 规则:ear → /iə(r)/，发音类似“ear”。
- activation:名词，词源来自“activate”（激活），词义：激活；活化。
  - 记忆方法:“activate”（激活）+“-ion”（名词后缀）→ 激活。
  - 形近词:activation/activate（激活）、active（活跃的）。
  - 发音解析:
    - 音节分解:ac + ti + va + tion /ˌæktɪˈveɪʃn/，重音在倒数第三音节。
    - 规则:ac → /æk/，“a”发短元音。
    - 规则:ti → /tɪ/，发短音。
    - 规则:va → /veɪ/，发音类似“vay”。
    - 规则:tion → /ʃn/，发音类似“shn”。
- rectified:形容词，词源来自“rectify”（纠正；整流），词义：整流的；纠正的。
  - 记忆方法:“rect”（直）+“-ify”（使……化）+“-ed”（形容词后缀）→ 使变直的 → 整流的。
  - 形近词:rectified/rectify（纠正；整流）、direct（直接的）。
  - 发音解析:
    - 音节分解:rec + ti + fy + ed /ˈrektɪfaɪd/，重音在第一音节。
    - 规则:rec → /rek/，“e”发短元音。
    - 规则:ti → /tɪ/，发短音。
    - 规则:fy → /faɪ/，“y”发长元音。
    - 规则:ed → /d/，发音类似“d”。

It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value X W + b. Ioffe and Szegedy (2015) recommend the latter.
- 句子分析:前一个句子中“it”是形式主语，真正的主语是“to wonder whether we should apply batch normalization to the input X, or to the transformed value X W + b”；后一个句子是简单句。
- 翻译:很自然会想知道我们是应该对输入X应用批量归一化，还是对变换后的值X W + b应用。伊奥费和塞格迪（2015年）推荐后者。
- 单词分析:
  - recommend:动词，词源来自拉丁语“re-”（再）+“commendare”（委托），词义：推荐；建议。
    - 记忆方法:“re-”（再）+“commend”（称赞）→ 再次称赞 → 推荐。
    - 形近词:recommend/recommendation（推荐；建议）、commend（称赞）。
    - 发音解析:
      - 音节分解:re + com + mend /ˌrekəˈmend/，重音在倒数第二音节。
      - 规则:re → /riː/，“e”发长元音。
      - 规则:com → /kəm/，“o”发短元音。
      - 规则:mend → /mend/，发音类似“mend”。

More specifically, X W + b should be replaced by a normalized version of X W.
- 句子分析:简单句，“X W + b”是主语，“should be replaced by”是被动语态的谓语，“a normalized version of X W”是动作的执行者。
- 翻译:更具体地说，X W + b应该被X W的归一化版本所取代。

The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparametrization.
- 句子分析:这是一个复合句，“because it becomes redundant with the β parameter applied by the batch normalization reparametrization”是原因状语从句。
- 翻译:偏置项应该被省略，因为它与批量归一化重新参数化所应用的β参数变得冗余。
- 单词分析:
  - omitted:动词过去式，词源来自拉丁语“omittere”（遗漏），词义：省略；遗漏。
    - 记忆方法:“o-”（离开）+“mit”（送）+“-ted”（过去式后缀）→ 送出去不管了 → 遗漏。
    - 形近词:omit/omission（遗漏）、commit（犯罪；承诺）。
    - 发音解析:
      - 音节分解:o + mit + ted /əˈmɪtɪd/，重音在第二音节。
      - 规则:o → /ə/，发短元音。
      - 规则:mit → /mɪt/，发短音。
      - 规则:ted → /ɪd/，发音类似“id”。
- redundant:形容词，词源来自拉丁语“redundare”（溢出），词义：多余的；冗余的。
  - 记忆方法:“re-”（再）+“dund”（流）+“-ant”（形容词后缀）→ 再次流出来 → 多余的。
  - 形近词:redundant/redundancy（冗余）、abundant（丰富的）。
  - 发音解析:
    - 音节分解:re + dun + dant /rɪˈdʌndənt/，重音在第二音节。
    - 规则:re → /rɪ/，发短元音。
    - 规则:dun → /dʌn/，“u”发短元音。
    - 规则:dant → /dənt/，发音类似“dənt”。

The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer.
- 句子分析:简单句，“The input to a layer”是主语，“is”是系动词，“the output of a nonlinear activation function such as the rectified linear function in a previous layer”是表语。
- 翻译:一层的输入通常是前一层中某个非线性激活函数（如整流线性函数）的输出。

合并后的句子

The statistics of the input are thus 320 more non - Gaussian and less amenable to standardization by linear operations.

The statistics of the input are thus 320 more non - Gaussian and less amenable to standardization by linear operations.
- 固定搭配:“be amenable to”意为“经得起；易受……影响；能接受”。
- 句子分析:这是一个主系表结构的句子，“The statistics of the input”是主语，“are”是系动词，“320 more non - Gaussian and less amenable to standardization by linear operations”是表语。句子描述了输入数据的统计特征。
- 翻译:因此，输入的统计数据具有320个更非高斯分布的特征，并且更难通过线性运算进行标准化。
- 单词分析:
  - non - Gaussian:形容词，“non -”为否定前缀，“Gaussian”来自数学家高斯（Gauss）的名字，词义：非高斯的。
    - 记忆方法:“non -”表示否定，“Gaussian”记住高斯这个名字，合起来就是非高斯的。
    - 形近词:Gaussian（高斯的）。
    - 发音解析:
      - 音节分解:non + Gaus + si + an /ˌnɒnˈɡaʊsiən/，重音在第二音节
      - 规则:non → /nɒn/， “non” 发 /nɒn/ 音，其中 “o” 发短元音 /ɒ/。
      - 规则:Gaus → /ɡaʊs/， “Gaus” 发 /ɡaʊs/ 音，其中 “au” 发 /aʊ/ 音。
      - 规则:si → /sɪ/， “si” 发 /sɪ/ 音，类似于 “sit” 的发音。
      - 规则:an → /ən/， “an” 发 /ən/ 音，其中 “a” 发短元音 /ə/。
- amenable:形容词，词源来自拉丁语 “amene”（令人愉快的），词义：经得起；易受……影响；能接受。
  - 记忆方法:联想 “a”（一个）+“men”（男人）+“able”（能够）→ 一个男人能够接受的 → 能接受的。
  - 形近词:amenity（便利设施）。
  - 发音解析:
    - 音节分解:a + me + na + ble /əˈmiːnəbl/，重音在第二音节
    - 规则:a → /ə/， “a” 发短元音 /ə/。
    - 规则:me → /miː/， “me” 发长音 /iː/。
    - 规则:na → /nə/， “na” 发 /nə/ 音，其中 “n” 发鼻音。
    - 规则:ble → /bl/， “ble” 发 /bl/ 音。
- standardization:名词，词源来自 “standard”（标准）+“-ization”（名词后缀，表示“……化”），词义：标准化。
  - 记忆方法:“standard”（标准）+“-ization”（……化）→ 标准化。
  - 形近词:standardize（使标准化）。
  - 发音解析:
    - 音节分解:stan + dar + di + za + tion /ˌstændədaɪˈzeɪʃn/，重音在第四音节
    - 规则:stan → /stæn/， “stan” 发 /stæn/ 音，其中 “a” 发短元音 /æ/。
    - 规则:dar → /dɑː/， “dar” 发 /dɑː/ 音，其中 “a” 发长元音 /ɑː/。
    - 规则:di → /dɪ/， “di” 发 /dɪ/ 音，类似于 “did” 的发音。
    - 规则:za → /zaɪ/， “za” 发 /zaɪ/ 音，其中 “a” 发 /aɪ/ 音。
    - 规则:tion → /ʃn/， “tion” 发 /ʃn/ 音。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

【Qwen-VL论文阅读】：打通视觉与语言的全能多模态大模型，从文字识别到精准定位全覆盖

AtomGit开源社区

MATLAB实现基于GA-XGBoost 遗传算法（GA）结合极端梯度提升（XGBoost）进行多特征分类预测的详细项目实例（含完整的程序，GUI设计和代码详解）专栏近期有大量优惠还请多多点

AtomGit开源社区

开发过程手册：Claude Code + DeepSeek V4 配置-随笔

在 VS Code 中使用 Claude Code 插件，后端模型切换为 DeepSeek V4（性价比高，兼容 Anthropic API）。"claudeCode.preferredLocation": "panel",// 可选：显示位置 panel/sidebar。"value": "你的DeepSeek_API_Key"// 换成自己在平台拿到的key。"claudeCode.selec