【2020华为云AI实战营】如何用深度学习算法自动给图片添加文字说明
【摘要】 给图片添加文字说明一直是AI里非常具有挑战性的一个课题,首先我们需要用CV来理解图片中的内容,接着是用NLP将理解到信息转化成文字,本文我们将分为以下几块:准备图像以及对应文字说明来训练模型设计并训练文字说明自动生成模型评估生成模型并测试图片以及说明文字数据库这里我们使用的是Flickr8K dataset,一共有8000张图片以及对应每一张图片一共有5条说明文字1000268201_693...
给图片添加文字说明一直是AI里非常具有挑战性的一个课题,首先我们需要用CV来理解图片中的内容,接着是用NLP将理解到信息转化成文字,
本文我们将分为以下几块:
准备图像以及对应文字说明来训练模型
设计并训练文字说明自动生成模型
评估生成模型并测试
图片以及说明文字数据库
这里我们使用的是Flickr8K dataset,一共有8000张图片以及对应每一张图片一共有5条说明文字
1000268201_693b08cb0e.jpg#0A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg#3A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg#4A little girl in a pink dress going into a wooden cabin .
首先我们有预训练模型来提取图片信息,这里我们使用VGG16
# extract features from each photo in the directory
def extract_features(directory):
# load the model
model = VGG16()
# re-structure the model
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
# summarize
print(model.summary())
# extract features from each photo
features = dict()
for name in listdir(directory):
# load an image from file
filename = directory + '/' + name
image = load_img(filename, target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
# get image id
image_id = name.split('.')[0]
# store feature
features[image_id] = feature
print('>%s' % name)
return features
注意如果加载预训练模型太久,下载太慢,建议手动上传到~/.keras/models/下
这一步如果是普通机器需要几个小时,但modelarts上大概只要10几分钟,最后我们将提取的feature信息存储到features.pkl
准备说明文字
# extract descriptions for images
def load_descriptions(doc):
mapping = dict()
# process lines
for line in doc.split('\n'):
# split line by white space
tokens = line.split()
if len(line) < 2:
continue
# take the first token as the image id, the rest as the description
image_id, image_desc = tokens[0], tokens[1:]
# remove filename from image id
image_id = image_id.split('.')[0]
# convert description tokens back to string
image_desc = ' '.join(image_desc)
# create the list if needed
if image_id not in mapping:
mapping[image_id] = list()
# store description
mapping[image_id].append(image_desc)
return mapping
清理一下说明文字,
a. 去掉图片文件拓展名
b. 将所有文字转换为小写字母,去除标点符号以及数字等
清理后的结果如下
1000268201_693b08cb0e child in pink dress is climbing up set of stairs in an entry way
1000268201_693b08cb0e girl going into wooden building
1000268201_693b08cb0e little girl climbing into wooden playhouse
1000268201_693b08cb0e little girl climbing the stairs to her playhouse
1000268201_693b08cb0e little girl in pink dress going into wooden cabin
...
建立模型
训练和验证图片数据集分别在Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt中,
# eg:1000268201_693b08cb0e
identifier = line.split('.')[0]
dataset.append(identifier)
# eg. 1000268201_693b08cb0e little girl in pink dress going into wooden cabin
# split line by white space
tokens = line.split()
# split id from description
image_id, image_desc = tokens[0], tokens[1:]
# skip images not in the set
if image_id in dataset:
# create list
if image_id not in descriptions:
descriptions[image_id] = list()
# wrap description in tokens
desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
# store
descriptions[image_id].append(desc)
接下去我们将descriptions encode成数字,以便作为模型的输入
给数据编码最重要的是将每一个词语映射成一个整数,这里我们用keras的Tokenizer来实现。
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
lines = to_lines(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer
每个文字说明中的前两个词语会作为模型的输入来生成下一个词语。
比如:
"girl going into wooden building"会生成6对input-output
X1 X2 (text sequence) y (word)
photo startseq, girl
photo startseq, girl going
photo startseq, girl, going, into
photo startseq, girl, going, into wooden
photo startseq, girl, going, into, wooden buidling
photo startseq, girl, going, into, wooden, building, endseq
我们已经得到编码好的序列加入到embedding层
输出的将会会是one-hot编码后的词语
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size):
X1, X2, y = list(), list(), list()
# walk through each image identifier
for key, desc_list in descriptions.items():
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photos[key][0])
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
定义模型
模型将分为三个部分
图片feature提取
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
2. 文字sequence处理
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
3. 解码
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
因为模型较复杂,需要较大的RAM,而且对内存不太友好,大家尽量选择64G以上规格的
开始训练,用modelarts的话每个epoch大概要10分钟左右,我们将每个epoch后的结果保存到checkpoint
我们的模型很快就过拟合了,所以我在第7个epoch就停止
评估模型
当然我们需要将编码映射到词语
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
为图片生成description
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
in_text = 'startseq'
# iterate over the whole length of the sequence
for i in range(max_length):
# integer encode input sequence
sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
yhat = argmax(yhat)
# map integer to word
word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
in_text += ' ' + word
# stop if we predict the end of the sequence
if word == 'endseq':
break
return in_text
评估模型
这里我们用BLEU score来评估生成文字同参考文字的相似度
这个score越接近1.0越好,0最差
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
actual, predicted = list(), list()
# step over the whole set
for key, desc_list in descriptions.items():
# generate description
yhat = generate_desc(model, tokenizer, photos[key], max_length)
# store actual and predicted
references = [d.split() for d in desc_list]
actual.append(references)
predicted.append(yhat.split())
# calculate BLEU score
print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
BLEU后面的数字指的是n-gram
BLEU-1: 0.547446
BLEU-2: 0.279654
BLEU-3: 0.186053
BLEU-4: 0.083339
最后我们来测试一下
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('model-ep005-loss3.548-val_loss3.859.h5')
# load and prepare the photograph
photo = extract_features('lennyhydrofoil.jpg')
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
print(description)
效果还是挺不错的!
【声明】本内容来自华为云开发者社区博主,不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息,否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
作者其他文章
评论(0)