【2020华为云AI实战营】如何用深度学习算法自动给图片添加文字说明
给图片添加文字说明一直是AI里非常具有挑战性的一个课题,首先我们需要用CV来理解图片中的内容,接着是用NLP将理解到信息转化成文字,
本文我们将分为以下几块:
准备图像以及对应文字说明来训练模型
设计并训练文字说明自动生成模型
评估生成模型并测试
图片以及说明文字数据库
这里我们使用的是Flickr8K dataset,一共有8000张图片以及对应每一张图片一共有5条说明文字
1000268201_693b08cb0e.jpg#0A child in a pink dress is climbing up a set of stairs in an entry way . 1000268201_693b08cb0e.jpg#1A girl going into a wooden building . 1000268201_693b08cb0e.jpg#2A little girl climbing into a wooden playhouse . 1000268201_693b08cb0e.jpg#3A little girl climbing the stairs to her playhouse . 1000268201_693b08cb0e.jpg#4A little girl in a pink dress going into a wooden cabin .
首先我们有预训练模型来提取图片信息,这里我们使用VGG16
# extract features from each photo in the directory def extract_features(directory): # load the model model = VGG16() # re-structure the model model = Model(inputs=model.inputs, outputs=model.layers[-2].output) # summarize print(model.summary()) # extract features from each photo features = dict() for name in listdir(directory): # load an image from file filename = directory + '/' + name image = load_img(filename, target_size=(224, 224)) # convert the image pixels to a numpy array image = img_to_array(image) # reshape data for the model image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) # prepare the image for the VGG model image = preprocess_input(image) # get features feature = model.predict(image, verbose=0) # get image id image_id = name.split('.')[0] # store feature features[image_id] = feature print('>%s' % name) return features
注意如果加载预训练模型太久,下载太慢,建议手动上传到~/.keras/models/下
这一步如果是普通机器需要几个小时,但modelarts上大概只要10几分钟,最后我们将提取的feature信息存储到features.pkl
准备说明文字
# extract descriptions for images def load_descriptions(doc): mapping = dict() # process lines for line in doc.split('\n'): # split line by white space tokens = line.split() if len(line) < 2: continue # take the first token as the image id, the rest as the description image_id, image_desc = tokens[0], tokens[1:] # remove filename from image id image_id = image_id.split('.')[0] # convert description tokens back to string image_desc = ' '.join(image_desc) # create the list if needed if image_id not in mapping: mapping[image_id] = list() # store description mapping[image_id].append(image_desc) return mapping
清理一下说明文字,
a. 去掉图片文件拓展名
b. 将所有文字转换为小写字母,去除标点符号以及数字等
清理后的结果如下
1000268201_693b08cb0e child in pink dress is climbing up set of stairs in an entry way 1000268201_693b08cb0e girl going into wooden building 1000268201_693b08cb0e little girl climbing into wooden playhouse 1000268201_693b08cb0e little girl climbing the stairs to her playhouse 1000268201_693b08cb0e little girl in pink dress going into wooden cabin ...
建立模型
训练和验证图片数据集分别在Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt中,
# eg:1000268201_693b08cb0e identifier = line.split('.')[0] dataset.append(identifier)
# eg. 1000268201_693b08cb0e little girl in pink dress going into wooden cabin # split line by white space tokens = line.split() # split id from description image_id, image_desc = tokens[0], tokens[1:] # skip images not in the set if image_id in dataset: # create list if image_id not in descriptions: descriptions[image_id] = list() # wrap description in tokens desc = 'startseq ' + ' '.join(image_desc) + ' endseq' # store descriptions[image_id].append(desc)
接下去我们将descriptions encode成数字,以便作为模型的输入
给数据编码最重要的是将每一个词语映射成一个整数,这里我们用keras的Tokenizer来实现。
# fit a tokenizer given caption descriptions def create_tokenizer(descriptions): lines = to_lines(descriptions) tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer
每个文字说明中的前两个词语会作为模型的输入来生成下一个词语。
比如:
"girl going into wooden building"会生成6对input-output X1 X2 (text sequence) y (word) photo startseq, girl photo startseq, girl going photo startseq, girl, going, into photo startseq, girl, going, into wooden photo startseq, girl, going, into, wooden buidling photo startseq, girl, going, into, wooden, building, endseq
我们已经得到编码好的序列加入到embedding层
输出的将会会是one-hot编码后的词语
# create sequences of images, input sequences and output words for an image def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size): X1, X2, y = list(), list(), list() # walk through each image identifier for key, desc_list in descriptions.items(): # walk through each description for the image for desc in desc_list: # encode the sequence seq = tokenizer.texts_to_sequences([desc])[0] # split one sequence into multiple X,y pairs for i in range(1, len(seq)): # split into input and output pair in_seq, out_seq = seq[:i], seq[i] # pad input sequence in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # store X1.append(photos[key][0]) X2.append(in_seq) y.append(out_seq) return array(X1), array(X2), array(y)
定义模型
模型将分为三个部分
图片feature提取
# feature extractor model inputs1 = Input(shape=(4096,)) fe1 = Dropout(0.5)(inputs1) fe2 = Dense(256, activation='relu')(fe1)
2. 文字sequence处理
# sequence model inputs2 = Input(shape=(max_length,)) se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2) se2 = Dropout(0.5)(se1) se3 = LSTM(256)(se2)
3. 解码
# decoder model decoder1 = add([fe2, se3]) decoder2 = Dense(256, activation='relu')(decoder1) outputs = Dense(vocab_size, activation='softmax')(decoder2)
因为模型较复杂,需要较大的RAM,而且对内存不太友好,大家尽量选择64G以上规格的
开始训练,用modelarts的话每个epoch大概要10分钟左右,我们将每个epoch后的结果保存到checkpoint
我们的模型很快就过拟合了,所以我在第7个epoch就停止
评估模型
当然我们需要将编码映射到词语
# map an integer to a word def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None
为图片生成description
# generate a description for an image def generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'startseq' # iterate over the whole length of the sequence for i in range(max_length): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # stop if we predict the end of the sequence if word == 'endseq': break return in_text
评估模型
这里我们用BLEU score来评估生成文字同参考文字的相似度
这个score越接近1.0越好,0最差
# evaluate the skill of the model def evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for key, desc_list in descriptions.items(): # generate description yhat = generate_desc(model, tokenizer, photos[key], max_length) # store actual and predicted references = [d.split() for d in desc_list] actual.append(references) predicted.append(yhat.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
BLEU后面的数字指的是n-gram BLEU-1: 0.547446 BLEU-2: 0.279654 BLEU-3: 0.186053 BLEU-4: 0.083339 最后我们来测试一下
# load the tokenizer tokenizer = load(open('tokenizer.pkl', 'rb')) # pre-define the max sequence length (from training) max_length = 34 # load the model model = load_model('model-ep005-loss3.548-val_loss3.859.h5') # load and prepare the photograph photo = extract_features('lennyhydrofoil.jpg') # generate description description = generate_desc(model, tokenizer, photo, max_length) print(description)
效果还是挺不错的!
- 点赞
- 收藏
- 关注作者
评论(0)