【Kaggle】鸟叫识别

举报
AI浩 发表于 2021/12/23 00:39:56 2021/12/23
【摘要】 目录 赛题  识别声景录音中的鸟叫声  文件 数据下载地址 赛题理解 code 音频数据转图像  切分训练集和验证集 训练 测试 赛题  识别声景录音中的鸟叫声          您在本次比赛中面临的...

目录

赛题 

识别声景录音中的鸟叫声 

文件

数据下载地址

赛题理解

code

音频数据转图像

 切分训练集和验证集

训练

测试


赛题 

识别声景录音中的鸟叫声 

        您在本次比赛中面临的挑战是确定哪些鸟类在长录音中调用,因为培训数据是在有意义的不同环境中生成的。这正是科学家试图自动化对鸟类种群的远程监测所面临的确切问题。本次比赛以上一场比赛为基础,增加了来自新地点的声景、更多的鸟类物种、关于测试集录音的更丰富的元数据以及火车集的声景。

文件介绍

train_short_audio - 大部分训练数据包括由xenocanto.org用户慷慨上传的个别鸟类呼叫的简短录音。这些文件已缩小到 32 kHz,适用于匹配测试集音频并转换为 ogg 格式。培训数据应包含几乎所有相关文件:我们期望在 xenocanto.org 上寻找更多,是没有好处

train_soundscapes - 与测试集相当的音频文件。它们都大约十分钟长,以奥格格式。测试集还具有此处所示的两个录制位置的声景。

test_soundscapes - 提交笔记本时,test_soundscapes目录将填充大约 80 个用于评分的录音。这些将是大约10分钟长,在奥格音频格式。文件名称包括记录的日期,这对于识别候鸟特别有用。

此文件夹还包含包含包含录制位置名称和近似坐标的文本文件,以及带有测试集声景录制日期集的 csv。

测试.csv - 只有前三行可供下载;完整的测试.csv是在隐藏的测试集。

  • row_id:行的ID代码。

  • site:站点 ID。

  • seconds:第二个结束时间窗口

  • audio_id:音频文件的ID代码。

train_metadata.csv - 为培训数据提供了广泛的元数据。最直接相关的领域是:

  • primary_label:鸟类的代码。您可以通过将代码附加到(如美国乌鸦)来查看有关鸟类代码的详细信息。https://ebird.org/species/https://ebird.org/species/amecro

  • recodist:提供录音的用户。

  • latitude & longitude:录音位置的坐标。有些鸟类可能具有当地称为"方言",因此您可能需要在培训数据中寻求地理多样性。

  • date:虽然有些鸟可以全年拨打电话,例如报警电话,但有些则仅限于特定季节。您可能需要在培训数据中寻求时间多样性。

  • filename:相关音频文件的名称。

train_soundscape_labels.csv -

  • row_id:行的ID代码。

  • site:站点 ID。

  • seconds:第二个结束时间窗口

  • audio_id:音频文件的ID代码。

  • birds:空间划定列表的任何鸟歌出现在5秒窗口。该标签表示未发生呼叫。nocall

sample_submission.csv - 一个正确形成的样品提交文件。只有前三行是公开的,其余的将作为隐藏测试集的一部分提供给您的笔记本。

  • row_id

  • birds:空间划定列表的任何鸟歌出现在5秒窗口。如果没有鸟叫,使用标签。nocall

数据下载地址

https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/25954/2091745/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1619356084&Signature=OX5U42MLcM%2FpZL%2F6D5PXQ%2Bn5fp%2FZc9%2Bpoba38LWoQvDE4PSesfq%2FEnlQr7RXVQi22GiLeRuPYsY5tYuqiEHzBAR6vhT8d1jJH1qefNEeLJcXyKIrPiPmY2%2FHugeMlQLq3jYIUuXcQFp3s9tHP8roqjnWbOAPveHAaRVozq%2BMq8wit%2BNbvL%2Fg0n9pcamGxluroHvOLbe88IoDrHLO8j2Zpg4Z7p2oku8yR1VrrXjVmZB%2FZVbnZRS5vIh8P5bioXmnK2zuYxD4cJ5MxiBj6BNbJ4WpROH2gryWMfA670mh5VHFy6TjoldPp85keMepjVTzOolh43BlaLcPUbEo7qimcA%3D%3D&response-content-disposition=attachment%3B+filename%3Dbirdclef-2021.zip

赛题理解

我对赛题的理解:本次比赛是对鸟叫声的分类,共有397类,将数据集中的给定的训练集按5s窗口宽度截取音频时域波形图,傅立叶变换得到频谱图,再由神经网络识别。

在这里要注意:空间划定列表的任何鸟叫声出现在5秒窗口,所以要注意将训练集按照每5秒切分一张图像。

code

音频数据转图像

音频转图像主要用到:librosa,将图像转为224×224的一维图像

安装命令:pip install librosa或者conda install -c conda-forge librosa


   
  1. import os
  2. import warnings
  3. warnings.filterwarnings(action='ignore')
  4. import pandas as pd
  5. import librosa
  6. import numpy as np
  7. from sklearn.utils import shuffle
  8. from PIL import Image
  9. from tqdm import tqdm
  10. # Global vars
  11. RANDOM_SEED = 1337
  12. SAMPLE_RATE = 32000
  13. SIGNAL_LENGTH = 5 # seconds
  14. SPEC_SHAPE = (224, 224) # height x width
  15. FMIN = 20
  16. FMAX = 16000
  17. # Code adapted from:
  18. # https://www.kaggle.com/frlemarchand/bird-song-classification-using-an-efficientnet
  19. # Make sure to check out the entire notebook.
  20. # Load metadata file
  21. train = pd.read_csv('../input/birdclef-2021/train_metadata.csv', )
  22. # Second, assume that birds with the most training samples are also the most common
  23. # A species needs at least 200 recordings with a rating above 4 to be considered common
  24. birds_count = {}
  25. for bird_species, count in zip(train.primary_label.unique(),
  26. train.groupby('primary_label')['primary_label'].count().values):
  27. birds_count[bird_species] = count
  28. most_represented_birds = [key for key, value in birds_count.items()]
  29. TRAIN = train.query('primary_label in @most_represented_birds')
  30. LABELS = sorted(TRAIN.primary_label.unique())
  31. # Let's see how many species and samples we have left
  32. print('NUMBER OF SPECIES IN TRAIN DATA:', len(LABELS))
  33. print('NUMBER OF SAMPLES IN TRAIN DATA:', len(TRAIN))
  34. print('LABELS:', most_represented_birds)
  35. # Shuffle the training data and limit the number of audio files to MAX_AUDIO_FILES
  36. TRAIN = shuffle(TRAIN, random_state=RANDOM_SEED)
  37. # Define a function that splits an audio file,
  38. # extracts spectrograms and saves them in a working directory
  39. def get_spectrograms(filepath, primary_label, output_dir):
  40. # Open the file with librosa (limited to the first 15 seconds)
  41. sig, rate = librosa.load(filepath, sr=SAMPLE_RATE, offset=None, duration=15)
  42. # Split signal into five second chunks
  43. sig_splits = []
  44. for i in range(0, len(sig), int(SIGNAL_LENGTH * SAMPLE_RATE)):
  45. split = sig[i:i + int(SIGNAL_LENGTH * SAMPLE_RATE)]
  46. # End of signal?
  47. if len(split) < int(SIGNAL_LENGTH * SAMPLE_RATE):
  48. break
  49. sig_splits.append(split)
  50. # Extract mel spectrograms for each audio chunk
  51. s_cnt = 0
  52. saved_samples = []
  53. for chunk in sig_splits:
  54. hop_length = int(SIGNAL_LENGTH * SAMPLE_RATE / (SPEC_SHAPE[1] - 1))
  55. mel_spec = librosa.feature.melspectrogram(y=chunk,
  56. sr=SAMPLE_RATE,
  57. n_fft=2048,
  58. hop_length=hop_length,
  59. n_mels=SPEC_SHAPE[0],
  60. fmin=FMIN,
  61. fmax=FMAX)
  62. mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
  63. # Normalize
  64. mel_spec -= mel_spec.min()
  65. mel_spec /= mel_spec.max()
  66. # Save as image file
  67. save_dir = os.path.join(output_dir, primary_label)
  68. if not os.path.exists(save_dir):
  69. os.makedirs(save_dir)
  70. save_path = os.path.join(save_dir, filepath.rsplit(os.sep, 1)[-1].rsplit('.', 1)[0] +
  71. '_' + str(s_cnt) + '.png')
  72. im = Image.fromarray(mel_spec * 255.0).convert("L")
  73. im.save(save_path)
  74. saved_samples.append(save_path)
  75. s_cnt += 1
  76. return saved_samples
  77. print('FINAL NUMBER OF AUDIO FILES IN TRAINING DATA:', len(TRAIN))
  78. # Parse audio files and extract training samples
  79. input_dir = '../input/birdclef-2021/train_short_audio/'
  80. output_dir = '../working/melspectrogram_dataset/'
  81. samples = []
  82. with tqdm(total=len(TRAIN)) as pbar:
  83. for idx, row in TRAIN.iterrows():
  84. pbar.update(1)
  85. if row.primary_label in most_represented_birds:
  86. audio_file_path = os.path.join(input_dir, row.primary_label, row.filename)
  87. samples += get_spectrograms(audio_file_path, row.primary_label, output_dir)
  88. print(samples)
  89. str_samples = ','.join(samples)
  90. TRAIN_SPECS = shuffle(samples, random_state=RANDOM_SEED)
  91. filename = open('a.txt', 'w')
  92. filename.write(str_samples)
  93. filename.close()

 下面的图像就是转换的结果:

 

 切分训练集和验证集

使用sklearn.model_selection 的 train_test_split切分数据集,按照7:3的比例切分训练集和验证集。


   
  1. import os
  2. import warnings
  3. warnings.filterwarnings(action='ignore')
  4. from sklearn.model_selection import train_test_split
  5. import shutil
  6. filename = open('a.txt', 'r')
  7. str_samples = filename.read()
  8. filename.close()
  9. str_samples = str_samples.replace("\\", "/")
  10. samples = str_samples.split(',')
  11. trainval_files, test_files = train_test_split(samples, test_size=0.3, random_state=42)
  12. train_dir = '../working/train/'
  13. val_dir = '../working/val/'
  14. def copyfiles(file, dir):
  15. filelist = file.split('/')
  16. filename = filelist[-1]
  17. lable = filelist[-2]
  18. cpfile = dir + "/" + lable
  19. if not os.path.exists(cpfile):
  20. os.makedirs(cpfile)
  21. cppath = cpfile + '/' + filename
  22. shutil.copy(file, cppath)
  23. for file in trainval_files:
  24. copyfiles(file, train_dir)
  25. for file in test_files:
  26. copyfiles(file, val_dir)

训练

模型采用EfficientNet的b3作为预训练模型,使用 datasets.ImageFolder加载数据集。差不多在20个epoch准确率能达到95%。


   
  1. import torch.optim as optim
  2. import torch
  3. import torch.nn as nn
  4. import torch.nn.parallel
  5. from torch.autograd import Variable
  6. import torch.optim
  7. import torch.utils.data
  8. import torch.utils.data.distributed
  9. import torchvision.transforms as transforms
  10. import torchvision.datasets as datasets
  11. from efficientnet_pytorch import EfficientNet
  12. import os
  13. import time
  14. # 设置超参数
  15. momentum = 0.9
  16. BATCH_SIZE = 32
  17. class_num = 397
  18. EPOCHS = 500
  19. lr = 0.001
  20. use_gpu = True
  21. net_name = 'efficientnet-b3'
  22. DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  23. # 数据预处理
  24. transform = transforms.Compose([
  25. transforms.Resize(224),
  26. transforms.ToTensor(),
  27. transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
  28. ])
  29. dataset_train = datasets.ImageFolder('../working/train', transform)
  30. dataset_val = datasets.ImageFolder('../working/val', transform)
  31. # 对应文件夹的label
  32. print(dataset_train.class_to_idx)
  33. dset_sizes = len(dataset_train)
  34. dset_sizes_val = len(dataset_val)
  35. print("dset_sizes_val Length:", dset_sizes_val)
  36. train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)
  37. test_loader = torch.utils.data.DataLoader(dataset_val, batch_size=BATCH_SIZE, shuffle=True)
  38. def exp_lr_scheduler(optimizer, epoch, init_lr=0.001, lr_decay_epoch=10):
  39. """Decay learning rate by a f# model_out_path ="./model/W_epoch_{}.pth".format(epoch)
  40. # torch.save(model_W, model_out_path) actor of 0.1 every lr_decay_epoch epochs."""
  41. lr = init_lr * (0.8 ** (epoch // lr_decay_epoch))
  42. print('LR is set to {}'.format(lr))
  43. for param_group in optimizer.param_groups:
  44. param_group['lr'] = lr
  45. return optimizer
  46. def train_model(model_ft, criterion, optimizer, lr_scheduler, num_epochs=50):
  47. train_loss = []
  48. since = time.time()
  49. best_model_wts = model_ft.state_dict()
  50. best_acc = 0.0
  51. model_ft.train(True)
  52. for epoch in range(num_epochs):
  53. print('Epoch {}/{}'.format(epoch, num_epochs - 1))
  54. print('-' * 10)
  55. optimizer = lr_scheduler(optimizer, epoch)
  56. running_loss = 0.0
  57. running_corrects = 0
  58. count = 0
  59. for data in train_loader:
  60. inputs, labels = data
  61. labels = torch.squeeze(labels.type(torch.LongTensor))
  62. if use_gpu:
  63. inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
  64. else:
  65. inputs, labels = Variable(inputs), Variable(labels)
  66. outputs = model_ft(inputs)
  67. loss = criterion(outputs, labels)
  68. _, preds = torch.max(outputs.data, 1)
  69. optimizer.zero_grad()
  70. loss.backward()
  71. optimizer.step()
  72. count += 1
  73. if count % 30 == 0 or outputs.size()[0] < BATCH_SIZE:
  74. print('Epoch:{}: loss:{:.3f}'.format(epoch, loss.item()))
  75. train_loss.append(loss.item())
  76. running_loss += loss.item() * inputs.size(0)
  77. running_corrects += torch.sum(preds == labels.data)
  78. epoch_loss = running_loss / dset_sizes
  79. epoch_acc = running_corrects.double() / dset_sizes
  80. print('Loss: {:.4f} Acc: {:.4f}'.format(
  81. epoch_loss, epoch_acc))
  82. if epoch_acc > best_acc:
  83. best_acc = epoch_acc
  84. best_model_wts = model_ft.state_dict()
  85. # save best model
  86. save_dir = 'model'
  87. os.makedirs(save_dir, exist_ok=True)
  88. model_ft.load_state_dict(best_model_wts)
  89. model_out_path = save_dir + "/" + net_name + '.pth'
  90. torch.save(model_ft, model_out_path)
  91. time_elapsed = time.time() - since
  92. print('Training complete in {:.0f}m {:.0f}s'.format(
  93. time_elapsed // 60, time_elapsed % 60))
  94. return train_loss, best_model_wts
  95. model_ft = EfficientNet.from_pretrained('efficientnet-b3')
  96. num_ftrs = model_ft._fc.in_features
  97. model_ft._fc = nn.Linear(num_ftrs, class_num)
  98. criterion = nn.CrossEntropyLoss()
  99. if use_gpu:
  100. model_ft = model_ft.cuda()
  101. criterion = criterion.cuda()
  102. optimizer = optim.Adam((model_ft.parameters()), lr=lr)
  103. train_loss, best_model_wts = train_model(model_ft, criterion, optimizer, exp_lr_scheduler, num_epochs=EPOCHS)

测试

将测试集按照5秒做切分,然后转为图像,这里转的图像是一维的,但是使用datasets.ImageFolder在的图像3维的,我查看了一张图像,发现着3维的数据是相同。由于输入是3维的,所以测试时的一维图像也要转为3维的,我在transform 做了操作,加入 transforms.Lambda(lambda x: x.repeat(3, 1, 1)),这样就转为3维的图像,其他的参照训练集处理逻辑更改就可以。


   
  1. import os
  2. import pandas as pd
  3. import torch
  4. import librosa
  5. import numpy as np
  6. # Global vars
  7. RANDOM_SEED = 1337
  8. SAMPLE_RATE = 32000
  9. SIGNAL_LENGTH = 5 # seconds
  10. SPEC_SHAPE = (224, 224) # height x width
  11. FMIN = 20
  12. FMAX = 16000
  13. # Load metadata file
  14. train = pd.read_csv('../input/birdclef-2021/train_metadata.csv', )
  15. # Second, assume that birds with the most training samples are also the most common
  16. # A species needs at least 200 recordings with a rating above 4 to be considered common
  17. birds_count = {}
  18. for bird_species, count in zip(train.primary_label.unique(),
  19. train.groupby('primary_label')['primary_label'].count().values):
  20. birds_count[bird_species] = count
  21. most_represented_birds = [key for key, value in birds_count.items()]
  22. TRAIN = train.query('primary_label in @most_represented_birds')
  23. LABELS = sorted(TRAIN.primary_label.unique())
  24. # Let's see how many species and samples we have left
  25. print('NUMBER OF SPECIES IN TRAIN DATA:', len(LABELS))
  26. print('NUMBER OF SAMPLES IN TRAIN DATA:', len(TRAIN))
  27. print('LABELS:', most_represented_birds)
  28. # First, get a list of soundscape files to process.
  29. # We'll use the test_soundscape directory if it contains "ogg" files
  30. # (which it only does when submitting the notebook),
  31. # otherwise we'll use the train_soundscape folder to make predictions.
  32. def list_files(path):
  33. return [os.path.join(path, f) for f in os.listdir(path) if f.rsplit('.', 1)[-1] in ['ogg']]
  34. test_audio = list_files('../input/birdclef-2021/test_soundscapes')
  35. if len(test_audio) == 0:
  36. test_audio = list_files('../input/birdclef-2021/train_soundscapes')
  37. print('{} FILES IN TEST SET.'.format(len(test_audio)))
  38. path = test_audio[0]
  39. data = path.split(os.sep)[-1].rsplit('.', 1)[0].split('_')
  40. print('FILEPATH:', path)
  41. print('ID: {}, SITE: {}, DATE: {}'.format(data[0], data[1], data[2]))
  42. # This is where we will store our results
  43. pred = {'row_id': [], 'birds': []}
  44. model = torch.load("./model/efficientnet-b3.pth")
  45. model.eval()
  46. import torchvision.transforms as transforms
  47. from PIL import Image
  48. transform = transforms.Compose([
  49. transforms.Resize(224),
  50. transforms.ToTensor(),
  51. transforms.Lambda(lambda x: x.repeat(3, 1, 1)),
  52. transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
  53. ])
  54. device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  55. # Analyze each soundscape recording
  56. # Store results so that we can analyze them later
  57. data = {'row_id': [], 'birds': []}
  58. for path in test_audio:
  59. path = path.replace("\\", "/")
  60. # Open file with Librosa
  61. # Split file into 5-second chunks
  62. # Extract spectrogram for each chunk
  63. # Predict on spectrogram
  64. # Get row_id and birds and store result
  65. # (maybe using a post-filter based on location)
  66. # The above steps are just placeholders, we will use mock predictions.
  67. # Our "model" will predict "nocall" for each spectrogram.
  68. sig, rate = librosa.load(path, sr=SAMPLE_RATE)
  69. # Split signal into 5-second chunks
  70. # Just like we did before (well, this could actually be a seperate function)
  71. sig_splits = []
  72. for i in range(0, len(sig), int(SIGNAL_LENGTH * SAMPLE_RATE)):
  73. split = sig[i:i + int(SIGNAL_LENGTH * SAMPLE_RATE)]
  74. # End of signal?
  75. if len(split) < int(SIGNAL_LENGTH * SAMPLE_RATE):
  76. break
  77. sig_splits.append(split)
  78. # Get the spectrograms and run inference on each of them
  79. # This should be the exact same process as we used to
  80. # generate training samples!
  81. seconds, scnt = 0, 0
  82. for chunk in sig_splits:
  83. # Keep track of the end time of each chunk
  84. seconds += 5
  85. # Get the spectrogram
  86. hop_length = int(SIGNAL_LENGTH * SAMPLE_RATE / (SPEC_SHAPE[1] - 1))
  87. mel_spec = librosa.feature.melspectrogram(y=chunk,
  88. sr=SAMPLE_RATE,
  89. n_fft=2048,
  90. hop_length=hop_length,
  91. n_mels=SPEC_SHAPE[0],
  92. fmin=FMIN,
  93. fmax=FMAX)
  94. mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
  95. # Normalize to match the value range we used during training.
  96. # That's something you should always double check!
  97. mel_spec -= mel_spec.min()
  98. mel_spec /= mel_spec.max()
  99. im = Image.fromarray(mel_spec * 255.0).convert("L")
  100. im = transform(im)
  101. print(im.shape)
  102. im.unsqueeze_(0)
  103. # 没有这句话会报错
  104. im = im.to(device)
  105. # Predict
  106. p = model(im)[0]
  107. print(p.shape)
  108. # Get highest scoring species
  109. idx = p.argmax()
  110. print(idx)
  111. species = LABELS[idx]
  112. print(species)
  113. score = p[idx]
  114. print(score)
  115. # Prepare submission entry
  116. spath = path.split('/')[-1].rsplit('_', 1)[0]
  117. print(spath)
  118. data['row_id'].append(path.split('/')[-1].rsplit('_', 1)[0] +
  119. '_' + str(seconds))
  120. # Decide if it's a "nocall" or a species by applying a threshold
  121. if score > 0.75:
  122. data['birds'].append(species)
  123. scnt += 1
  124. else:
  125. data['birds'].append('nocall')
  126. print('SOUNSCAPE ANALYSIS DONE. FOUND {} BIRDS.'.format(scnt))
  127. # Make a new data frame and look at a few "results"
  128. results = pd.DataFrame(data, columns=['row_id', 'birds'])
  129. results.head()
  130. # Convert our results to csv
  131. results.to_csv("submission.csv", index=False)

 

 

文章来源: wanghao.blog.csdn.net,作者:AI浩,版权归原作者所有,如需转载,请联系作者。

原文链接:wanghao.blog.csdn.net/article/details/116031884

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。