python的PyVCF模块过滤突变,并计算杂合度
VCF的全称是Variant Call Format, 一般是指用变异检测工具分析得到的突变的结果格式
python中有一个模块可以过滤VCF - pyvcf。
安装方法:
pip install PyVCF==0.6.8
VCF文件内容的一个示例如下:
##fileformat=VCFv4.1 ##samtoolsVersion=0.1.19-44428cd ##reference=file:///data/hsa/hg19/hg19_genome.fa ##contig=<ID=chr10,length=135534747> ##contig=<ID=chr11,length=135006516> ##contig=<ID=chr12,length=133851895> ##contig=<ID=chr13,length=115169878> ##contig=<ID=chr14,length=107349540> ##contig=<ID=chr15,length=102531392> ##contig=<ID=chr16,length=90354753> ##contig=<ID=chr17,length=81195210> ##contig=<ID=chr18,length=78077248> ##contig=<ID=chr19,length=59128983> ##contig=<ID=chr1,length=249250621> ##contig=<ID=chr20,length=63025520> ##contig=<ID=chr21,length=48129895> ##contig=<ID=chr22,length=51304566> ##contig=<ID=chr2,length=243199373> ##contig=<ID=chr3,length=198022430> ##contig=<ID=chr4,length=191154276> ##contig=<ID=chr5,length=180915260> ##contig=<ID=chr6,length=171115067> ##contig=<ID=chr7,length=159138663> ##contig=<ID=chr8,length=146364022> ##contig=<ID=chr9,length=141213431> ##contig=<ID=chrM,length=16571> ##contig=<ID=chrX,length=155270560> ##contig=<ID=chrY,length=59373566> ... #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 19001_A.sorted.bam 19002_A.sorted.bam 19003_D.sorted.bam chr10 60263 . A C 95 . DP=50;VDB=2.316651e... GT:PL:DP:GQ 0/1:0,0,0:0:3 0/1:0,0,0:0:3 1/1:127,9,0:50:9 chr10 61372 . CA C 22.5 . INDEL;IS=20,0.04... GT:PL:DP:GQ 0/1:0,0,0:0:3 0/1:0,0,0:0:3 0/1:60,0,140:74:61
#开头的是文件的说明或meta信息,
每个突变的位点对应的信息是:
染色体编号,
碱基位置,
在dbSNP中的ID(一般是rs开头,没有则是.),
参考基因组在该位置的碱基,
突变的碱基,
该突变的质量(越高越好),
FILTER则是指是否通过过滤标准,
关于该位点的统计信息,DP是测序深度,等等
每个样本的信息的各列的简称
每个样本内的遗传信息数据
pyvcf模块第一个方便的地方是可以计算MAF和Heterozygosity(杂合度)。
这个杂合度是根据哈迪-温伯格定律计算的, MAF就是次要等位基因频率(Minor Allele Frequency)
比如说有一个位点是有两个等位基因显性A,隐性a。 在1000个人当中,人的染色体是二倍体的,那么A和a的总数是2000, 假如A出现了500次,那么a就是1500, 次要等位基因就是A, 频率为25%, 杂合度则等于 2* 0.25 * 0.75 = 0.375
pyvcf模块第二个方便的地方是可以指定范围,直接提取这个范围内的所有突变。这依赖于pysam模块, 使用方法如下:
import pysam import vcf for record in vcf_reader.fetch('20', 1110695, 1230237): print(record)
关于杂合度的详细参考: http://www.uwyo.edu/dbmcd/molmark/lect04/lect4.html
关于pyvcf的介绍: https://pyvcf.readthedocs.io/en/latest/INTRO.html
pyvcf的详细用法示例如下:
#!/usr/bin/env python # -*- coding: UTF-8 -*- import vcf vcf_reader = vcf.Reader(open("test_variants.vcf", 'r')) vcf_W = vcf.Writer(open("filtered_var.vcf", 'w'), vcf_reader) filtered_f = open("test_output_SNP.txt", 'w') filtered_INDEL = open("test_output_IND.txt" ,'w') filtered_f.write("Chr\tPOS\tRec_ID\tREF\tALT\tMAF\tHET\n") filtered_INDEL.write("Chr\tPOS\tRec_ID\tREF\tALT\tMAF\tHET\n") for record in vcf_reader: if all(s['DP'] >=5 for s in record.samples): if record.INFO['DP'] >= 25: vcf_W.write_record(record) # check this position is a SNP and its AF > 0.5, and then write this line into test_output_SNP.txt if record.is_snp and record.INFO['AF1'] > 0.5: filtered_f.write("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\n".format\ (record.CHROM, record.POS, record.ID, record.REF, record.ALT, 1-record.INFO['AF1'], record.heterozygosity)) # check this position is a SNP and its AF <= 0.5, and then write this line into test_output_SNP.txt elif record.is_snp and record.INFO['AF1'] <= 0.5: filtered_f.write("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\n".format\ (record.CHROM, record.POS, record.ID, record.REF, record.ALT, record.INFO['AF1'], record.heterozygosity)) # check INDEL and AF1 and write INDEL into test_output_IND.txt elif record.is_indel and record.INFO['AF1'] > 0.5: filtered_INDEL.write("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\n".format\ (record.CHROM, record.POS, record.ID, record.REF, record.ALT, 1-record.INFO['AF1'], record.heterozygosity)) # check INDEL and AF1 and write INDEL into test_output_IND.txt elif record.is_indel and record.INFO['AF1'] <= 0.5: filtered_INDEL.write("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\n".format\ (record.CHROM, record.POS, record.ID, record.REF, record.ALT, record.INFO['AF1'], record.heterozygosity)) filtered_f.close() filtered_INDEL.close() print ("Filtration of VCF file finished.")
vcfReader 读取一个vcf文件 vcf_reader = vcf.Reader(open("test_variants.vcf", 'r'))
vcfWriter会写入一个vcf文件,需要提供vcfReader的 header信息 , vcf_W = vcf.Writer(open("filtered_var.vcf", 'w'), vcf_reader) 。
- 点赞
- 收藏
- 关注作者
评论(0)