thumt package¶
Submodules¶
thumt.binmt module¶
-
class
thumt.binmt.
BiRNNsearch
(config)¶ Bases:
thumt.nmt.model
The bidirectional RNNsearch model used for semi-supervised training.
-
build
()¶ Building the computational graph.
-
get_addition_grads
(cost0, cost1, now_s)¶ Updating the total cost of bidirectional NMT.
-
get_inputs_batch
(xp, yp, xm, ym)¶ Getting a batch for semi-supervised training.
Parameters: - xp (numpy array) – the indexed source sentences in parallel corpus
- yp (numpy array) – the indexed target sentences in parallel corpus
- xm (numpy array) – the indexed source sentences in monolingual corpus
- ym (numpy array) – the indexed target sentences in monolingual corpus
-
is_valid
(sents, eos, unk)¶ Validating the sentences.
Parameters: - sents (theano variable) – the indexed sentences
- eos (int) – the index of end-of-sentence symbol
-
sample
(x, length, n_samples=1)¶ Sampling with source-to-target network.
Parameters: - x (numpy array) – the indexed source sentence
- length (int) – the length limit of samples
- n_samples (int) – number of samples
Returns: a numpy array, the indexed sample results
-
sample_inv
(x, length, n_samples=1)¶ Sampling with target-to-source network.
Parameters: - x (numpy array) – the indexed target sentence
- length (int) – the length limit of samples
- n_samples (int) – number of samples
Returns: a numpy array, the indexed sample results
-
translate
(x, beam_size=10)¶ Beam search with source-to-target network.
Parameters: - x (numpy array) – the indexed source sentence
- beam_size (int) – beam size
Returns: a numpy array, the indexed translation result
-
translate_inv
(x, beam_size=10)¶ Beam search with target-to-source network.
Parameters: - x (numpy array) – the indexed target sentence
- beam_size (int) – beam size
Returns: a numpy array, the indexed translation result
-
thumt.data module¶
-
class
thumt.data.
DataCollection
(config, train=True)¶ Bases:
object
The data manager. It also reserve the training status.
Parameters: - config (dict) – the configuration
- train (bool) – Only set to true on training. If true, the vocabulary and corpus will be loaded, and the training status will be recorded.
-
encode_vocab
(encoding=’utf-8’)¶ Change the encoding of the vocabulary.
-
index_word_target
(index)¶ Get the target word given index.
Parameters: index (int) – the word index Returns: string, the corresponding word.
-
last_improved
(last=False)¶ Parameters: last (bool) – if True, considering getting the same result as improved. And vice versa. Returns: int. The number of iteration passed after the latest improvement
-
load_data
()¶ Load training corpus.
-
load_data_mono
()¶ Load monolingual training courpus. Only used in semi-supervised training.
-
load_status
(path)¶ Load the training status from file.
Parameters: path (string) – the path to a file
-
load_vocab
()¶ Load the vocabulary.
-
next
()¶ Get the next batch of training corpus
Returns: x, y are 2-D numpy arrays, each row contains an indexed source/target sentence
-
next_mono
()¶ Get the next batch of monolingual training corpus. Only used in semi-supervised training.
Returns: x, y are 2-D numpy arrays, each row contains an indexed source/target sentence
-
print_sentence
(sentence, vocab, index_eos)¶ get the text form of a sentence represented by an index vector.
Parameters: - sentence (numpy array) – indexed sentence. size:(length, 1)
- vocab (list) – vocabulary
- index_eos (int) – the index of the end-of-sentence symbol
Returns: string, the text form of the sentence
-
print_source
(sentence)¶ Print a source sentence represented by an index vector.
Parameters: sentence (numpy array) – indexed sentence. size:(length, 1) Returns: string, the text form of the source sentence
-
print_target
(sentence)¶ Print a target sentence represented by an index vector.
Parameters: sentence (numpy array) – indexed sentence. size:(length, 1) Returns: string, the text form of the target sentence
-
save_status
(path)¶ Save the training status to file.
Parameters: path (string) – the path to a file
-
toindex
(sentence, ivocab, index_unk, index_eos)¶ Transform a sentence text to indexed sentence.
Parameters: - sentence (string) – sentence text
- ivocab (dict) – the vocabulary to index
- indexed_unk (int) – the index of unknown word symbol
- index_eos (int) – the index of end-of-sentence symbol
Returns: numpy array, the indexed sentence
-
toindex_source
(sentence)¶ Transform a source language word list to index list.
Parameters: sentence (string) – sentence text Returns: numpy array, the indexed source sentence
-
toindex_target
(sentence)¶ Transform a target language word list to index list.
Parameters: sentence (string) – sentence text Returns: numpy array, the indexed target sentence
-
thumt.data.
getbatch
(lx, ly, config)¶ Get a batch for training.
Parameters: - lx (numpy array) – 2-D numpy arrays, each row contains an indexed source sentence
- ly (numpy array) – 2-D numpy arrays, each row contains an indexed target sentence
- config (dict) – the configuration
thumt.layer module¶
-
class
thumt.layer.
FeedForwardLayer
(name, dim_in, dim_out, active=<theano.tensor.elemwise.Elemwise object>, offset=False)¶ Bases:
thumt.layer.Layer
A single-layer feed-forward neural network.
Parameters: - dim_in (int) – the dimension of input vectors
- dim_out (int) – the dimension of output vectors
- active (function) – the activation function
- offset (bool) – if true, this layer will contain bias
-
forward
(state_in)¶ Build the computational graph.
Parameters: state_in (theano variable) – the input state
-
class
thumt.layer.
GatedRecurrentLayer
(name, dim_in, dim, active=<theano.tensor.elemwise.Elemwise object>, verbose=False)¶ Bases:
thumt.layer.Layer
The gated recurrent layer used to encode the source sentences.
Parameters: - dim_in (int) – the number of input units
- dim (int) – the number of hidden state units
- active (function) – the activation function
- verbose (bool) – only set to True on visualization
-
forward
(emb_in, length, state_init=None, batch_size=1, mask=None)¶ Build the computational graph which computes the hidden states.
Parameters: - emb_in (theano variable) – the input word embeddings
- length (theano variable) – the length of the input
- batch_size (int) – the batch size
- mask (theano variable) – indicate the length of each sequence in one batch
-
forward_step
(state_before, state_in, gate_in, reset_in, mask=None)¶ Build the one-step computational graph which computes the next hidden state.
Parameters: - state_before (theano variable) – The previous hidden state
- state_in (theano variable) – the input state
- gate_in (theano variable) – the input to update gate
- reset_in (theano variable) – the input to reset gate
- mask (theano variable) – indicate the length of each sequence in one batch
-
class
thumt.layer.
GatedRecurrentLayer_attention
(name, dim_in, dim_c, dim, dim_class, active=<theano.tensor.elemwise.Elemwise object>, maxout=2, verbose=False)¶ Bases:
thumt.layer.Layer
The gated recurrent layer with attention mechanism, used as decoder.
Parameters: - dim_in (int) – the number of input units
- dim_c (int) – the number of context units
- dim (int) – the number of hidden units
- dim_class (int) – the number of target vocabulary
- active (function) – the activation function
- maxout (int) – the number of maxout parts
- verbose (bool) – only set to True on visualization
-
decode_next
(c, state, emb)¶ Get the next hidden state. Used in beam search and sampling.
Parameters: - c (theano variable) – the current context reading
- state (theano variable) – the last hidden state
- emb (theano variable) – the embedding of the last generated word
-
decode_probs
(context, state, emb)¶ Get the probability of the next word. Used in beam search and sampling.
Parameters: - context (theano variable) – the context vectors
- state (theano variable) – the last hidden state
- emb (theano variable) – the embedding of the last generated word
-
forward
(emb_in, length, context, state_init, batch_size=1, mask=None, cmask=None)¶ Build the computational graph which computes the hidden states.
Parameters: - emb_in (theano variable) – the input word embeddings
- length (theano variable) – the length of the input
- context (theano variable) – the context vectors
- state_init (theano variable) – the inital states
- batch_size (int) – the batch size
- mask (theano variable) – indicate the length of each sequence in one batch
- cmask (theano variable) – indicate the length of each context sequence in one batch
-
forward_step
(state_before, state_in, gate_in, reset_in, context, att_c, mask=None, cmask=None)¶ Build the one-step computational graph which calculates the next hidden state.
Parameters: - state_before (theano variable) – The previous hidden state
- state_in (theano variable) – the input state
- gate_in (theano variable) – the input to update gate
- reset_in (theano variable) – the input to reset gate
- mask (theano variable) – indicate the length of each sequence in one batch
- cmask (theano variable) – indicate the length of each context sequence in one batch
- context (theano variable) – the context vectors
- att_c (theano variable) – the attention vector from context
-
class
thumt.layer.
Layer
¶ Bases:
object
The parent class of neural network layers.
-
class
thumt.layer.
LayerFactory
¶ Bases:
object
The factory to build and monitor all neural network layers.
-
class
thumt.layer.
LookupTable
(name, num, dim_embed, offset=True)¶ Bases:
thumt.layer.Layer
A lookup table layer which reserves the word embeddings.
Parameters: - num (int) – the number of words
- dim_embed (int) – the dimension of the word embedding
- offset (bool) – if true, this layer will contain bias
-
forward
(index)¶ Build the computational graph.
Parameters: index (theano variable) – the input state (word indice)
thumt.mrt_utils module¶
-
thumt.mrt_utils.
calBleu
(x, ref_dict, lens, ngram)¶ Calculate BLEU score with single reference
Parameters: - x (list) – the indexed hypothesis sentence
- ref_dict (dict) – the n-gram count generated by getRefDict()
- lens (int) – the length of the reference
- ngram (int) – maximum length of counted n-grams
-
thumt.mrt_utils.
cutSen
(x, config)¶ Cut the part after the end-of-sentence symbol
Parameters: - x (list) – indexed sentence
- config (dict) – the configuration
-
thumt.mrt_utils.
getMRTBatch
(x, xmask, y, ymask, config, model, data)¶ Get a batch for MRT training
Parameters: - x (numpy array) – the indexed source sentence
- xmask (numpy array) – indicate the length of each sequence in source sequence
- y (numpy array) – the indexed target sentence
- ymask (numpy array) – indicate the length of each sequence in target sequence
- config (dict) – the configuration
- model (Model) – the NMT model
- data (DataCollection) – the data manager
-
thumt.mrt_utils.
getRefDict
(words, ngram)¶ Get the count of n-grams in the reference
Parameters: - words (list) – indexed sentence
- ngram (int) – maximum length of counted n-grams
-
thumt.mrt_utils.
getUnique
(samples, y, config, model, data)¶ Remove repeated sentences from sampling results. Then calculate BLEU score for each sentence.
Parameters: - y (numpy array) – the indexed target sentence
- config (dict) – the configuration
- model (Model) – the NMT model
- data (DataCollection) – the data manager
-
thumt.mrt_utils.
getYM
(y, config)¶ Get masks which indicate the length of target sentences
Parameters: - y (list) – the indexed sentences
- config (dict) – the configuration
thumt.nmt module¶
-
class
thumt.nmt.
RNNsearch
(config, name=”)¶ Bases:
thumt.nmt.model
The attention-based NMT model
-
build
(verbose=False)¶ Build the computational graph.
Parameters: verbose (bool) – only set to True on visualization
-
decode_sample
(state_init, c, length, n_samples)¶ Build the decoder graph for sampling.
Parameters: - state_init (theano variables) – the initial state of decoder
- c (theano variables) – the context vectors
- length (int) – the limitation of sample length
- n_samples (int) – the number of samples
-
encode
(x)¶ Encode source sentence to context vector.
-
get_attention
(x, xmask, y, ymask)¶ Get the attention weight of parallel sentences.
-
get_context_and_init
(x)¶ Encode source sentence to context vectors and get the initial decoder hidden state.
-
get_cost
(x, xmask, y, ymask)¶ Get the negative log-likelihood of parallel sentences.
-
get_init
(c)¶ Get the initial decoder hidden state with context vector.
-
get_layer
(x, xmask, y, ymask)¶ Get the hidden states essential for visualization
-
get_next
(c, state, emb)¶ Get the next hidden state.
-
get_probs
(c, state, emb)¶ Get the probability of the next target word.
-
get_sample
(x, length, n_samples)¶ Get sampling results.
-
get_trg_embedding
(y)¶ Get the embedding of target sentence.
-
sampling_step
(state, prev, context)¶ Build the computational graph which samples the next word.
Parameters: - state (theano variables) – the previous hidden state
- prev (theano variables) – the last generated word
- context (theano variables) – the context vectors.
-
-
class
thumt.nmt.
model
¶ Bases:
object
The parent class of NMT models
-
load
(path, decode=False)¶ Load the model from npz format. It will load from the checkpoint model. If checkpoint model does not exist, it will initialize a new model (MLE) or load from given model (MRT or semi)
Parameters: - path (string) – the path to a file
- decode (bool) – Set to True only on decoding
-
sample
(x, length, n_samples=1)¶ Sample with probability.
Parameters: - x (numpy array) – the indexed source sentence
- length (int) – the length limit of samples
- n_samples (int) – number of samples
Returns: a numpy array, the indexed sample results
-
save
(path, data=None, mapping=None)¶ Save the model in npz format.
Parameters: - path (string) – the path to a file
- data (DataCollection) – the data manager, will save the vocabulary into the model if set.
- mapping (dict) – the mapping file used in UNKreplace, will save it to the model if set
-
translate
(x, beam_size=10, return_array=False)¶ Decode with beam search.
Parameters: - x (numpy array) – the indexed source sentence
- beam_size (int) – beam size
Returns: a numpy array, the indexed translation result
-
thumt.preprocess module¶
-
thumt.preprocess.
preprocess
(num, vocab_f, ivocab_f, input_f, output_f, data_vocab=’cPickle’, data_corpus=’json’, withdict=False, fromgr=False)¶ Count the most frequent words in the training corpus. Then, generate the vocabulary file and index the corpus.
Parameters: - vocab_f (string) – the path to vocabulary file
- ivocab_f (string) – the path to vocabulary to index file
- input_f (string) – the path to corpus (text) file
- output_f (string) – the path to indexed corpus file
- withdict (bool) – if set to True, vocabulary will be loaded from existing file instead.
-
thumt.preprocess.
shuffle
(src, trg, src_shuf, trg_shuf, data_corpus=’json’)¶ Randomly shuffling the parallel corpus
Parameters: - src (string) – the path to indexed source corpus file
- trg (string) – the path to indexed target corpus file
- src_shuf (string) – the path to shuffled indexed source corpus file
- trg_shuf (string) – the path to shuffled indexed target corpus file
-
thumt.preprocess.
shuffle_mono
(src, src_shuf, data_corpus=’json’)¶ Randomly shuffling the monolingual corpus.
Parameters: - src (string) – the path to indexed corpus file
- src_shuf (string) – the path to indexed corpus file
thumt.tools module¶
-
thumt.tools.
bleu
(hypo_c, refs_c, n)¶ Calculate BLEU score given translation and references.
Parameters: - hypo_c (string) – the translations
- refs_c (list) – the list of references
- n (int) – maximum length of counted n-grams
-
thumt.tools.
bleu_file
(hypo, refs)¶ Calculate the BLEU score given translation files and reference files.
Parameters: - hypo (string) – the path to translation file
- refs (list) – the list of path to reference files
-
thumt.tools.
clip
(grads, threshold, square=True, params=None)¶ Build the computational graph that clips the gradient if the norm of the gradient exceeds the threshold.
Parameters: - grads (theano variable) – the gradient to be clipped
- threshold (float) – the threshold of the norm of the gradient
Returns: theano variable. The clipped gradient.
-
thumt.tools.
cut_sentence
(sentence, index_eos)¶ Cut the sentence after the end-of-sentence symbol.
Parameters: - sentence (numpy array) – the indexed sentence
- index_eos (int) – the index of end-of-sentence symbol
-
thumt.tools.
dot3d
(input, weight)¶ Build the computational graph of 3-d matrix multiply operation.
Parameters: - input (theano variable) – the input variable
- weight (theano variable) – the weight parameter
-
thumt.tools.
duplicate
(input, times)¶ Broadcast a 2-D tensor given times on axis 1.
Parameters: input (theano variable) – the input variable
-
thumt.tools.
get_ref_files
(ref)¶ Get the list of reference files by prefix. Suppose nist02.en0, nist02.en1, nist02.en2, nist02.en3 are references and nist02.en does not exist, then get_ref_files(“nist02.en”) = [“nist02.en0”, “nist02.en1”, “nist02.en2”, “nist02.en3”]
Parameters: ref (string) – the prefix of reference files
-
thumt.tools.
init_bias
(size, name, scale=0.01, shared=True)¶ Initialize bias paramater in neural networks.
Parameters: - size (tuple or list) – the size of the parameter
- name (string) – the name of the parameter
- scale (float) – the scale of the parameter
-
thumt.tools.
init_weight
(size, name, scale=0.01, shared=True)¶ Randomly initialize weight parameter in neural networks.
Parameters: - size (tuple or list) – the size of the parameter
- name (string) – the name of the parameter
- scale (float) – the scale of the parameter
-
thumt.tools.
init_zeros
(size, shared=True)¶ Initialize a zero matrix.
Parameters: size (tuple or list) – the size of the matirx
-
thumt.tools.
maxout
(input, max_num=2)¶ Build the computational graph of maxout operation.
Parameters: - input (theano variable) – the input variable
- max_num (int) – the number of maxout parts
-
thumt.tools.
merge_dict
(d1, d2)¶ Merge two dicts. The count of each item is the maximum count in two dicts.
-
thumt.tools.
padzero
(input)¶ Build the computational graph that pads zeros to the left of the input.
Parameters: input (theano variable) – the input variable
-
thumt.tools.
print_time
(time)¶ Parameters: time (float) – the number of seconds Returns: string, the text format of time
-
thumt.tools.
sentence2dict
(sentence, n)¶ Count the number of n-grams in a sentence.
Parameters: - sentence (string) – sentence text
- n (int) – maximum length of counted n-grams
-
thumt.tools.
shift_one
(input)¶ Add a zero vector to the left side of input and remove the rightmost vector.
Parameters: input (theano variable) – the input variable
-
thumt.tools.
softmax
(energy, axis=1)¶ The softmax operation.
Parameters: energy (theano variable) – the energy value for each class
-
thumt.tools.
softmax3d
(input)¶ Build the computational graph of the softmax operation.
Parameters: input (theano variable) – the input variable
thumt.lrp module¶
-
class
thumt.lrp.
BackEncoderVal
(word_num, config)¶ Bases:
object
The class which stores the intermediate variables produced by backward encoder in NMT mdoel
Parameters: - word_num (int) – the length of source sentence
- config (dict) – the configuration file
-
readData
(filename)¶ Load intermediate variables produced by backward encoder in NMT model
Parameters: filename (string) – the file which stores intermediate variables
-
class
thumt.lrp.
DecoderVal
(src_word_num, trg_word_num, config)¶ Bases:
object
The class which stores the intermediate variables produced by decoder in NMT mdoel
Parameters: - src_word_num (int) – the length of source sentence
- trg_word_num (int) – the length of target sentence
- config (dict) – the configuration file
-
readData
(filename)¶ Load intermediate variables produced by decoder in NMT model
Parameters: filename (string) – the file which stores intermediate variables
-
class
thumt.lrp.
EncoderVal
(word_num, config)¶ Bases:
object
The class which stores the intermediate variables produced by forward encoder in NMT mdoel
Parameters: - word_num (int) – the length of source sentence
- config (dict) – the configuration file
-
readData
(filename)¶ Load intermediate variables produced by forward encoder in NMT model
Parameters: filename (string) – the file which stores intermediate variables
-
class
thumt.lrp.
Model
(src_word_num, trg_word_num, config)¶ Bases:
object
The class which calculates the relevance
Parameters: - src_word_num (int) – the length of source sentence
- trg_word_num (int) – the length of target sentence
- config (dict) – the configuration file
-
cal_back_encoder
()¶ Calculate the relevance in backward encoder
Returns: 4-D numpy array, the relevance between backward encoder hidden states and x
-
cal_decoder
(R_x)¶ Calculate the relevance in decoder
Parameters: R_x (theano sharedVariable) – the relevance bewteen inputs and the hidden states of encoder. Returns: R_c_x,R_h_x,R_o_x,R_h_y,R_o_y are numpy arrays, they are relevance between context and x, relevance bewteen decoder hidden states and x, relevance bewteen readout and x, relevance bewteen decoder hidden states and y, relevance bewteen readout and y
-
cal_decoder_step
(decoder_val)¶ Calculate the weight ratios in decoder
Parameters: decoder_val (class) – the class which stores the intermediate variables in decoder Returns: R_h_h, R_h_x, R_h_y, R_outenergy_2_h, R_outenergy_2_x, R_outenergy_2_y_before are theano variables, weight ratios in decoder.
-
cal_encoder
()¶ Calculate the relevance in forward encoder
Returns: 4-D numpy array, the relevance between forward encoder hidden states and x
-
cal_encoder_step
(encoder_val)¶ Calculate the weight ratios in encoder
Parameters: decoder_val (class) – the class which stores the intermediate variables in encoder Returns: R_h_x, R_h_h are theano variables, weight ratios in encoder
-
readData
(param_filename, val_filename)¶ Load the parameters of NMT models and intermediate variables
Parameters: - param_filename (string) – the file which stores the parameters of NMT models
- val_filename (string) – the file which stores intermediate variables
-
thumt.lrp.
init_idx
(size, name, shared=True)¶ Initialize words indexes in sentence
-
thumt.lrp.
init_weight
(size, name, shared=True)¶ Initialize weight matrice
-
thumt.lrp.
normalize
(R, f)¶ Normalize the relevance vector and write to result into output file
Parameters: - R (4-D numpy array) – the relevance vector to be normalized
- f (file) – output file
-
thumt.lrp.
save
(att, f)¶ write attentions into output file
Parameters: - att (2-D numpy array) – attention information
- f (file) – output file