Skip Deep LSTM(關于LSTM的trick)
首先,網(wang)絡(luo)深度對模型性能影響極為關鍵
但是,在(zai)LSTM的應用中(zhong),通常堆疊三層以上的LSTM訓練困(kun)難(nan),存在(zai)梯度消失或爆炸(zha)的問題
因此,借鑒GNMT(谷歌翻(fan)譯系統)的思想(xiang),提出一(yi)種基(ji)于稠密跳躍連接的深度LSTM(Skip Deep LSTM)
實驗表明,在(zai)圖(tu)像(xiang)理解(jie)任務(wu)(wu)上(shang),訓練(lian)loss優于(yu)傳統LSTM,同時,在(zai)時序預測等(deng)任務(wu)(wu)(發電預測等(deng))上(shang),該模型設(she)計(ji)方(fang)式優于(yu)常(chang)用LSTM。
模(mo)型(xing)設(she)計參考了GNMT思想(xiang),第一層為(wei)雙向長短(duan)時記憶網絡(luo)(BiLSTM),深度為(wei)5-7層效果最(zui)佳,在image caption 和(he)時序預測問題上優(you)于常(chang)用LSTM。
核心代碼如下(xia)(基于tf2實(shi)現(xian)):
# 基于稠密連接的深度LSTM,可根據實驗情況搭建不同尺度連接和網絡層數,目前5-7層效果最佳
bi1 = (Bidirectional(LSTM(64, return_sequences=True)))(se2)
bi2 = LSTM(128, return_sequences=True)(bi1)
bi3 = LSTM(128, return_sequences=True)(bi2)
res1 = add([bi1, bi3])
bi4 = LSTM(128, return_sequences=True)(res1)
res2 = add([bi2, bi4, bi1])
bi5 = LSTM(128, return_sequences=True)(res2)
res3 = add([bi3, bi5, bi2, bi1])
bi6 = LSTM(128, return_sequences=True)(res3)
res4 = add([bi4, bi6, bi3, bi1])
se3 = LSTM(256)(res4)
#融合的LSTM
bi1_1 = (Bidirectional(LSTM(16, return_sequences=True)))(add1)
bi1_2 = (Bidirectional(LSTM(16, return_sequences=True)))(bi1_1)
bi1_3 = (Bidirectional(LSTM(16, return_sequences=True)))(bi1_2)
# attention_mul = attention_3d_block(bi1)
bi2_1 = LSTM(32, return_sequences=True)(se2)
bi2_2 = LSTM(32, return_sequences=True)(bi2_1)
bi2_3 = LSTM(32, return_sequences=True)(bi2_2)
res1 = add([bi1_1, bi2_1, bi1_3, bi2_3])
bi1_4 = (Bidirectional(LSTM(16, return_sequences=True)))(res1)
bi2_4 = LSTM(32, return_sequences=True)(res1)
res2 = add([bi1_1, bi2_1, bi1_2, bi2_2])
bi1_5 = (Bidirectional(LSTM(16, return_sequences=True)))(res2)
bi2_5 = LSTM(32, return_sequences=True)(res2)
res3 = add([bi1_1, bi2_1, bi1_2, bi2_2, bi1_3, bi2_3])
# se3 = LSTM(256)(res3)
se3 = (Bidirectional(LSTM(128)))(res3)
decoder2 = Dense(256, activation='relu')(se3)
基于時間步注意力機制的嵌入
# 基于timeStep的注意力
def attention_3d_block(inputs):
    # input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)  # 置換維度
    # Dense層神經元個數就是最大單詞數,在時序預測問題中,是輸入的特征數
    a = Dense(36, activation='tanh')(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    # output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    output_attention_mul = multiply([inputs, a_probs], name='attention_mul')
    return output_attention_mul
# 一般問題是第一層LSTM和第二層LSTM中嵌入timeStep注意力較好,可針對不同問題進行大量實驗,選擇最佳位置
bi1 = (Bidirectional(LSTM(64, return_sequences=True)))(se2)
attention_mul = attention_3d_block(bi1)
bi2 = LSTM(128, return_sequences=True)(attention_mul)
bi3 = LSTM(128, return_sequences=True)(bi2)
res1 = add([bi1, bi3])
bi4 = LSTM(128, return_sequences=True)(res1)
res2 = add([bi2, bi4, bi1])
bi5 = LSTM(128, return_sequences=True)(res2)
res3 = add([bi3, bi5, bi2, bi1])
bi6 = LSTM(128, return_sequences=True)(res3)
res4 = add([bi4, bi6, bi3, bi1])
se3 = LSTM(256)(res4)
 
 
