本教程训练一个序列到序列 (seq2seq)模型,实现西班牙语到英语的翻译。这是一个高级示例,要求您对序列到序列模型有一定的了解。
训练模型后,输入一个西班牙语句子将返回对应英文翻译,例如 "¿todavia estan en casa?" ,返回 "are you still at home?"
对于一个玩具例子来说,翻译质量是合理的,但是生成的注意情节可能更有趣。这说明在翻译过程中,模型注意到了输入句子的哪些部分:
注意:此示例在单个P100 GPU上运行大约需要10分钟。
x1from __future__ import absolute_import, division, print_function, unicode_literals
2
3import tensorflow as tf
4
5import matplotlib.pyplot as plt
6from sklearn.model_selection import train_test_split
7
8import unicodedata
9import re
10import numpy as np
11import os
12import io
13import time
我们将使用 http://www.manythings.org/anki/ 提供的语言数据集。此数据集包含以下格式的语言翻译对:
xxxxxxxxxx
11May I borrow this book? ¿Puedo tomar prestado este libro?
有多种语言可供选择,但我们将使用英语 - 西班牙语数据集。为方便起见,我们在Google Cloud上托管了此数据集的副本,但您也可以下载自己的副本。下载数据集后,以下是我们准备数据的步骤:
xxxxxxxxxx
61# Download the file
2path_to_zip = tf.keras.utils.get_file(
3 'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
4 extract=True)
5
6path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
xxxxxxxxxx
31 Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
2 2646016/2638744 [==============================] - 0s 0us/step
3 2654208/2638744 [==============================] - 0s 0us/step
xxxxxxxxxx
241# Converts the unicode file to ascii
2def unicode_to_ascii(s):
3 return ''.join(c for c in unicodedata.normalize('NFD', s)
4 if unicodedata.category(c) != 'Mn')
5
6
7def preprocess_sentence(w):
8 w = unicode_to_ascii(w.lower().strip())
9
10 # creating a space between a word and the punctuation following it
11 # eg: "he is a boy." => "he is a boy ."
12 # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
13 w = re.sub(r"([?.!,¿])", r" \1 ", w)
14 w = re.sub(r'[" "]+', " ", w)
15
16 # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
17 w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
18
19 w = w.rstrip().strip()
20
21 # adding a start and an end token to the sentence
22 # so that the model know when to start and stop predicting.
23 w = '<start> ' + w + ' <end>'
24 return w
xxxxxxxxxx
41en_sentence = u"May I borrow this book?"
2sp_sentence = u"¿Puedo tomar prestado este libro?"
3print(preprocess_sentence(en_sentence))
4print(preprocess_sentence(sp_sentence).encode('utf-8'))
xxxxxxxxxx
21 <start> may i borrow this book ? <end>
2 <start> ¿ puedo tomar prestado este libro ? <end>
xxxxxxxxxx
91# 1. Remove the accents
2# 2. Clean the sentences
3# 3. Return word pairs in the format: [ENGLISH, SPANISH]
4def create_dataset(path, num_examples):
5 lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
6
7 word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
8
9 return zip(*word_pairs)
xxxxxxxxxx
31en, sp = create_dataset(path_to_file, None)
2print(en[-1])
3print(sp[-1])
xxxxxxxxxx
21 <start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
2 <start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>
xxxxxxxxxx
21def max_length(tensor):
2 return max(len(t) for t in tensor)
xxxxxxxxxx
111def tokenize(lang):
2 lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
3 filters='')
4 lang_tokenizer.fit_on_texts(lang)
5
6 tensor = lang_tokenizer.texts_to_sequences(lang)
7
8 tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
9 padding='post')
10
11 return tensor, lang_tokenizer
xxxxxxxxxx
81def load_dataset(path, num_examples=None):
2 # creating cleaned input, output pairs
3 targ_lang, inp_lang = create_dataset(path, num_examples)
4
5 input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
6 target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
7
8 return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer
对 > 100,000个句子的完整数据集进行训练需要很长时间。为了更快地训练,我们可以将数据集的大小限制为30,000个句子(当然,翻译质量会随着数据的减少而降低):
xxxxxxxxxx
61# Try experimenting with the size of that dataset
2num_examples = 30000
3input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
4
5# Calculate max_length of the target tensors
6max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
xxxxxxxxxx
51# Creating training and validation sets using an 80-20 split
2input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
3
4# Show length
5len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)
xxxxxxxxxx
11 (24000, 24000, 6000, 6000)
xxxxxxxxxx
41def convert(lang, tensor):
2 for t in tensor:
3 if t!=0:
4 print ("%d ----> %s" % (t, lang.index_word[t]))
xxxxxxxxxx
51print ("Input Language; index to word mapping")
2convert(inp_lang, input_tensor_train[0])
3print ()
4print ("Target Language; index to word mapping")
5convert(targ_lang, target_tensor_train[0])
xxxxxxxxxx
191 Input Language; index to word mapping
2 1 ----> <start>
3 8 ----> no
4 38 ----> puedo
5 804 ----> confiar
6 20 ----> en
7 1000 ----> vosotras
8 3 ----> .
9 2 ----> <end>
10
11 Target Language; index to word mapping
12 1 ----> <start>
13 4 ----> i
14 25 ----> can
15 12 ----> t
16 345 ----> trust
17 6 ----> you
18 3 ----> .
19 2 ----> <end>
tf.data
数据集xxxxxxxxxx
101BUFFER_SIZE = len(input_tensor_train)
2BATCH_SIZE = 64
3steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
4embedding_dim = 256
5units = 1024
6vocab_inp_size = len(inp_lang.word_index)+1
7vocab_tar_size = len(targ_lang.word_index)+1
8
9dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
10dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
xxxxxxxxxx
21example_input_batch, example_target_batch = next(iter(dataset))
2example_input_batch.shape, example_target_batch.shape
xxxxxxxxxx
11 (TensorShape([64, 16]), TensorShape([64, 11]))
我们将实现一个使用注意力机制的编码器-解码器模型,您可以在TensorFlow 神经机器翻译(seq2seq)教程中阅读。此示例使用更新的API集,实现了seq2seq教程中的注意方程式。下图显示了每个输入单词由注意机制分配权重,然后解码器使用该权重来预测句子中的下一个单词。
通过编码器模型输入,该模型给出了形状 (batch_size, max_length, hidden_size) 的编码器输出和形状 (batch_size, hidden_size) 的编码器隐藏状态。
下面是实现的方程:
我们用的是 Bahdanau attention 。在写出简化形式之前,我们先来定义符号:
定义伪代码:
score = FC(tanh(FC(EO) + FC(H)))
attention weights = softmax(score, axis = 1)
. 默认情况下Softmax应用于最后一个轴,但是我们要在 第一轴 上应用它,因为得分的形状是 (batch_size, max_length, hidden_size) 。Max_length
是我们输入的长度。由于我们尝试为每个输入分配权重,因此应在该轴上应用softmax。context vector = sum(attention weights * EO, axis = 1)
. 选择轴为1的原因与上述相同。embedding output
= 译码器X的输入通过嵌入层传递merged vector = concat(embedding output, context vector)
每个步骤中所有向量的形状都已在代码中的注释中指定:
xxxxxxxxxx
181class Encoder(tf.keras.Model):
2 def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
3 super(Encoder, self).__init__()
4 self.batch_sz = batch_sz
5 self.enc_units = enc_units
6 self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
7 self.gru = tf.keras.layers.GRU(self.enc_units,
8 return_sequences=True,
9 return_state=True,
10 recurrent_initializer='glorot_uniform')
11
12 def call(self, x, hidden):
13 x = self.embedding(x)
14 output, state = self.gru(x, initial_state = hidden)
15 return output, state
16
17 def initialize_hidden_state(self):
18 return tf.zeros((self.batch_sz, self.enc_units))
xxxxxxxxxx
71encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
2
3# sample input
4sample_hidden = encoder.initialize_hidden_state()
5sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
6print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
7print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))
xxxxxxxxxx
21 Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
2 Encoder Hidden state shape: (batch size, units) (64, 1024)
xxxxxxxxxx
261class BahdanauAttention(tf.keras.Model):
2 def __init__(self, units):
3 super(BahdanauAttention, self).__init__()
4 self.W1 = tf.keras.layers.Dense(units)
5 self.W2 = tf.keras.layers.Dense(units)
6 self.V = tf.keras.layers.Dense(1)
7
8 def call(self, query, values):
9 # hidden shape == (batch_size, hidden size)
10 # hidden_with_time_axis shape == (batch_size, 1, hidden size)
11 # we are doing this to perform addition to calculate the score
12 hidden_with_time_axis = tf.expand_dims(query, 1)
13
14 # score shape == (batch_size, max_length, hidden_size)
15 score = self.V(tf.nn.tanh(
16 self.W1(values) + self.W2(hidden_with_time_axis)))
17
18 # attention_weights shape == (batch_size, max_length, 1)
19 # we get 1 at the last axis because we are applying score to self.V
20 attention_weights = tf.nn.softmax(score, axis=1)
21
22 # context_vector shape after sum == (batch_size, hidden_size)
23 context_vector = attention_weights * values
24 context_vector = tf.reduce_sum(context_vector, axis=1)
25
26 return context_vector, attention_weights
xxxxxxxxxx
51attention_layer = BahdanauAttention(10)
2attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
3
4print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
5print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
xxxxxxxxxx
21 Attention result shape: (batch size, units) (64, 1024)
2 Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)
xxxxxxxxxx
351class Decoder(tf.keras.Model):
2def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
3super(Decoder, self).__init__()
4self.batch_sz = batch_sz
5self.dec_units = dec_units
6self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
7self.gru = tf.keras.layers.GRU(self.dec_units,
8return_sequences=True,
9return_state=True,
10recurrent_initializer='glorot_uniform')
11self.fc = tf.keras.layers.Dense(vocab_size)
12
13# used for attention
14self.attention = BahdanauAttention(self.dec_units)
15
16def call(self, x, hidden, enc_output):
17# enc_output shape == (batch_size, max_length, hidden_size)
18context_vector, attention_weights = self.attention(hidden, enc_output)
19
20# x shape after passing through embedding == (batch_size, 1, embedding_dim)
21x = self.embedding(x)
22
23# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
24x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
25
26# passing the concatenated vector to the GRU
27output, state = self.gru(x)
28
29# output shape == (batch_size * 1, hidden_size)
30output = tf.reshape(output, (-1, output.shape[2]))
31
32# output shape == (batch_size, vocab)
33x = self.fc(output)
34
35return x, state, attention_weights
xxxxxxxxxx
61decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
2
3sample_decoder_output, _, _ = decoder(tf.random.uniform((64, 1)),
4 sample_hidden, sample_output)
5
6print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))
xxxxxxxxxx
11 Decoder output shape: (batch_size, vocab size) (64, 4935)
xxxxxxxxxx
121optimizer = tf.keras.optimizers.Adam()
2loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
3 from_logits=True, reduction='none')
4
5def loss_function(real, pred):
6 mask = tf.math.logical_not(tf.math.equal(real, 0))
7 loss_ = loss_object(real, pred)
8
9 mask = tf.cast(mask, dtype=loss_.dtype)
10 loss_ *= mask
11
12 return tf.reduce_mean(loss_)
xxxxxxxxxx
51checkpoint_dir = './training_checkpoints'
2checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
3checkpoint = tf.train.Checkpoint(optimizer=optimizer,
4 encoder=encoder,
5 decoder=decoder)
xxxxxxxxxx
301function .
2def train_step(inp, targ, enc_hidden):
3 loss = 0
4
5 with tf.GradientTape() as tape:
6 enc_output, enc_hidden = encoder(inp, enc_hidden)
7
8 dec_hidden = enc_hidden
9
10 dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
11
12 # Teacher forcing - feeding the target as the next input
13 for t in range(1, targ.shape[1]):
14 # passing enc_output to the decoder
15 predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
16
17 loss += loss_function(targ[:, t], predictions)
18
19 # using teacher forcing
20 dec_input = tf.expand_dims(targ[:, t], 1)
21
22 batch_loss = (loss / int(targ.shape[1]))
23
24 variables = encoder.trainable_variables + decoder.trainable_variables
25
26 gradients = tape.gradient(loss, variables)
27
28 optimizer.apply_gradients(zip(gradients, variables))
29
30 return batch_loss
xxxxxxxxxx
231EPOCHS = 10
2
3for epoch in range(EPOCHS):
4 start = time.time()
5
6 enc_hidden = encoder.initialize_hidden_state()
7 total_loss = 0
8
9 for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
10 batch_loss = train_step(inp, targ, enc_hidden)
11 total_loss += batch_loss
12
13 if batch % 100 == 0:
14 print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
15 batch,
16 batch_loss.numpy()))
17 # saving (checkpoint) the model every 2 epochs
18 if (epoch + 1) % 2 == 0:
19 checkpoint.save(file_prefix = checkpoint_prefix)
20
21 print('Epoch {} Loss {:.4f}'.format(epoch + 1,
22 total_loss / steps_per_epoch))
23 print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
xxxxxxxxxx
71 ......
2 Epoch 10 Batch 0 Loss 0.1219
3 Epoch 10 Batch 100 Loss 0.1374
4 Epoch 10 Batch 200 Loss 0.1084
5 Epoch 10 Batch 300 Loss 0.0994
6 Epoch 10 Loss 0.1088
7 Time taken for 1 epoch 29.2324090004 sec
注意:编码器输出仅针对一个输入计算一次。
xxxxxxxxxx
391def evaluate(sentence):
2 attention_plot = np.zeros((max_length_targ, max_length_inp))
3
4 sentence = preprocess_sentence(sentence)
5
6 inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
7 inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
8 maxlen=max_length_inp,
9 padding='post')
10 inputs = tf.convert_to_tensor(inputs)
11
12 result = ''
13
14 hidden = [tf.zeros((1, units))]
15 enc_out, enc_hidden = encoder(inputs, hidden)
16
17 dec_hidden = enc_hidden
18 dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
19
20 for t in range(max_length_targ):
21 predictions, dec_hidden, attention_weights = decoder(dec_input,
22 dec_hidden,
23 enc_out)
24
25 # storing the attention weights to plot later on
26 attention_weights = tf.reshape(attention_weights, (-1, ))
27 attention_plot[t] = attention_weights.numpy()
28
29 predicted_id = tf.argmax(predictions[0]).numpy()
30
31 result += targ_lang.index_word[predicted_id] + ' '
32
33 if targ_lang.index_word[predicted_id] == '<end>':
34 return result, sentence, attention_plot
35
36 # the predicted ID is fed back into the model
37 dec_input = tf.expand_dims([predicted_id], 0)
38
39 return result, sentence, attention_plot
xxxxxxxxxx
121# function for plotting the attention weights
2def plot_attention(attention, sentence, predicted_sentence):
3 fig = plt.figure(figsize=(10,10))
4 ax = fig.add_subplot(1, 1, 1)
5 ax.matshow(attention, cmap='viridis')
6
7 fontdict = {'fontsize': 14}
8
9 ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
10 ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
11
12 plt.show()
xxxxxxxxxx
81def translate(sentence):
2 result, sentence, attention_plot = evaluate(sentence)
3
4 print('Input: %s' % (sentence))
5 print('Predicted translation: {}'.format(result))
6
7 attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
8 plot_attention(attention_plot, sentence.split(' '), result.split(' '))
xxxxxxxxxx
21# restoring the latest checkpoint in checkpoint_dir
2checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
xxxxxxxxxx
11translate(u'hace mucho frio aqui.')
xxxxxxxxxx
21 Input: <start> hace mucho frio aqui . <end>
2 Predicted translation: it s very cold here . <end>
xxxxxxxxxx
11translate(u'esta es mi vida.')
xxxxxxxxxx
21 Input: <start> esta es mi vida . <end>
2 Predicted translation: this is my life . <end>
xxxxxxxxxx
11translate(u'¿todavia estan en casa?')
xxxxxxxxxx
21 Input: <start> ¿ todavia estan en casa ? <end>
2 Predicted translation: are you still at home ? <end>
xxxxxxxxxx
21# wrong translation
2translate(u'trata de averiguarlo.')
xxxxxxxxxx
21 Input: <start> trata de averiguarlo . <end>
2 Predicted translation: try to figure it out . <end>
最新版本:https://www.mashangxue123.com/tensorflow/tf2-tutorials-text-nmt_with_attention.html 英文版本:https://tensorflow.google.cn/beta/tutorials/text/nmt_with_attention 翻译建议PR:https://github.com/mashangxue/tensorflow2-zh/edit/master/r2/tutorials/text/nmt_with_attention.md