当前位置:   article > 正文

生物信息数据格式:fastq格式_fastq转fasta

fastq转fasta

格式说明

FASTQ文件中每个序列通常有四行:

1.第一行:必须以“@”开头,后面跟着唯一的序列ID标识符,然后跟着可选的序列描述内容,标识符与描述内容用空格分开;

2.第二行:序列字符(核酸为[AGCTN]+,蛋白为氨基酸字符);

3.第三行:必须以“+”开头,后面跟着可选的ID标识符和可选的描述内容,如果“+”后面有内容,该内容必须与第一行“@”后的内容相同;

4.第四行:碱基质量字符,每个字符对应第二行相应位置碱基或氨基酸的质量,该字符可以按一定规则转换为碱基质量得分,碱基质量得分可以反映该碱基的错误率。这一行的字符数与第二行中的字符数必须相同。

查看fatsq文件

!cat ./data/test1.fq
  • 1
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,5=@B8?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@DD@A=7?E#,,,;=(>3;=;;C>ACCC@CCCCCBBBCCAACCCCCCC
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#4=DFFEFHHHHHJJJJJIJJJJHIIJGJJJJ@GIIJJJJJJIJJJJFGHIIIJJHHHDFFFFDDDDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

每个碱基的质量值与错误率之间的关系:

Q = − 10 l o g 10 P Q = -10 log_{10}P Q=10log10P

其中P代表该碱基被测序错误的概率,如果该碱基测序出错的概率为0.001,则Q应该为30,那么30+33=63,那么63对应的ASCii码为“?”,则在第四行中该碱基对应的质量代表值即为“?”

print(ord("?"))
  • 1
63
  • 1

一般地,碱基质量从0-40,既ASCii码为从 “!”(0+33)到“I”(40+33)。以上是sanger中心采用记录read测序质量的方法

对于每个碱基的质量编码标示,不同的软件采用不同的方案,目前有5种方案:

  • Sanger,Phred quality score,值的范围从0到92,对应的ASCII码从33到126,但是对于测序数据(raw read data)质量得分通常小于60,序列拼接或者mapping可能用到更大的分数。

  • Solexa/Illumina 1.0, Solexa/Illumina quality score,值的范围从-5到63,对应的ASCII码从59到126,对于测序数据,得分一般在-5到40之间;

  • Illumina 1.3+,Phred quality score,值的范围从0到62对应的ASCII码从64到126,低于测序数据,得分在0到40之间;

  • Illumina 1.5+,Phred quality score,但是0到2作为另外的标示,详见http://solexaqa.sourceforge.net/questions.htm#illumina

  • Illumina 1.8+

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6lozhwm1-1583370774136)()]

ASCII值小于等于58(相应的质量得分小于等于25)对应的字符只有在Phred+33的编码中被使用,所有Phred+64所使用的字符的ASCII值都大于等于59。在通常情况下,ASCII值大于等于74的字符只出现在Phred+64中。利用这些信息即可在程序中进行判断

实例演练

判断fastq序列编码是Phred33(Illumina1.8+) or Phred64(Illumina1.3+)

def fq_phred_check(inputfile): 
    """判断fastq序列编码"""
    fastq_file = open(inputfile,'r',encoding='utf-8') 
    count = 1 
    for line in fastq_file: 
        line_strip = line.strip() 
        if count % 4 == 0: 
            for each in line_strip: 
                ASCII_each = ord(each) 
                if ASCII_each > 75: 
                    phred_value = 64
                    break
                elif ASCII_each < 58:
                    phred_value = 33
                    break      
        count += 1 
    fastq_file.close()  
    return phred_value

inputfile = './data/test1.fq'
phred_value = fq_phred_check(inputfile)
print(phred_value)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
33
  • 1

fastq转换fasta格式

def fastq_fasta(inputfile): 
    """将fastq转换成fasta序列格式""" 
    fastq_file = open(inputfile,'r',encoding='utf-8') 
    out_file_name = inputfile.strip().split('/')[-1].split('.')[0] + '.fasta' 
    output_fasta = open(out_file_name,'w',encoding='utf-8') 
    i = 0 
    for line in fastq_file: 
        if i % 4 == 0: 
            line_new = line.strip().split('\n')[0].replace('@','>') 
            output_fasta.write(line_new + '\n') 
        elif (i - 1) % 4 == 0: 
            output_fasta.write(line)
        i = i + 1 
    print("fastq transform fasta success", out_file_name)
    fastq_file.close() 
    output_fasta.close() 
    
def main(): 
    inputfile = './data/test1.fq' 
    fastq_fasta(inputfile) 
    
if __name__ == '__main__': 
    main()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
fastq transform fasta success test1.fasta
  • 1

Linux 操作fastq

获取数据

mkdir -p ~/biosoft
cd ~/biosoft
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip

unzip bowtie2-2.3.4.3-linux-x86_64.zip
cd ~/biosoft/bowtie2-2.3.4.3-linux-x86_64/example/reads
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

统计reads_1.fq文件中共有多少条序列信息

[sunchengquan 08:07:20 ~/local/app/bowtie2-2.2.4/example/reads]
$ ll
总用量 8.4M
-rwxrwx--- 1 sunchengquan sunchengquan 4.0M 10月 23 2014 longreads.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_1.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_2.fq
-rwxrwx--- 1 sunchengquan sunchengquan 9.1K 10月 23 2014 simulate.pl
[sunchengquan 08:07:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc reads_1.fq
  40000   40000 2285692 reads_1.fq
[sunchengquan 08:07:42 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc -l reads_1.fq
40000 reads_1.fq

[sunchengquan 09:06:02 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $[`wc -l reads_1.fq |cut -d' ' -f1`/4]
10000
[sunchengquan 09:06:52 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $((`wc -l reads_1.fq |cut -d' ' -f1`/4))
10000
[sunchengquan 09:07:10 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1`/4)
40000/4
[sunchengquan 09:07:35 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1` / 4)
10000
[sunchengquan 09:12:17 ~/local/app/bowtie2-2.2.4/example/reads]
$ let c=`wc -l reads_1.fq |cut -d' ' -f1`/4
[sunchengquan 09:12:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $c
10000
[sunchengquan 09:16:11 ~/local/app/bowtie2-2.2.4/example/reads]
$ a=`wc -l reads_1.fq |cut -d' ' -f1`
[sunchengquan 09:16:23 ~/local/app/bowtie2-2.2.4/example/reads]
$ b=4
[sunchengquan 09:16:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / $b`
10000
[sunchengquan 09:18:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / 4`
10000

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

输出reads_1.fq文件中的标识符(即以@开头的那一行)

[sunchengquan 09:18:22 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |head -3
@r1
@r2
@r3

[sunchengquan 09:21:59 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |cut -f1 |cut -c1|head -3
@
@
@


[sunchengquan 09:22:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |wc
  10000   40000 2285692
[sunchengquan 09:21:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1|cut -c1|head -3
@
@
@

[sunchengquan 09:27:19 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) {print $0}}' reads_1.fq|head -3
@r1
@r2
@r3

[sunchengquan 09:28:47 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) print $0}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:31:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print}}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:32:28 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{print NR}' reads_1.fq|head -3
1
2
3
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43

输出reads_1.fq文件中所有序列的信息(即每个序列的第二行)

[sunchengquan 09:34:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|head -1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
  • 1
  • 2
  • 3

输出reads_1.fq文件中‘+’及其后面的描述信息(即每个序列的第三行)

[sunchengquan 09:35:06 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==3){print $0 }}' reads_1.fq|head -3
+
+
+
  • 1
  • 2
  • 3
  • 4
  • 5

输出reads_1.fq文件中质量值信息(即每个序列的第四行)

[sunchengquan 09:36:24 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print $0 }}' reads_1.fq|head -1
+"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>
  • 1
  • 2
  • 3

计算reads_1.fq文件含有N碱基的reads个数

[sunchengquan 09:39:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep N|wc
   6429    6429  782897

[sunchengquan 09:41:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep -c N
6429
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

计算reads_1.fq文件里面的序列碱基总数

[sunchengquan 12:46:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[ATGCN]' |wc
1088399 1088399 2176798

[sunchengquan 12:47:12 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length}}' reads_1.fq |head -1
122
[sunchengquan 12:49:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |head -1
122
[sunchengquan 12:58:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |paste -s -d +|bc
1088399
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

计算reads_1.fq文件中所有reads的N碱基总数

[sunchengquan 12:58:50 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[N]' |wc
  26001   26001   52002
  • 1
  • 2
  • 3

计算reads_1.fq中测序碱基质量值恰好为Q20的个数

[sunchengquan 17:14:00 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+20))'
5
[sunchengquan 17:15:32 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+30))'
?

[sunchengquan 17:15:43 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '[5]'|wc -l 
21369
[sunchengquan 17:17:18 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '5'|wc -l 
21369
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

计算reads_1.fq中测序碱基质量值恰好为Q30的个数

[sunchengquan 17:17:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '?'|wc -l 
21574
  • 1
  • 2
  • 3

计算reads_1.fq中所以序列的第一位碱基的ATCGNactg分布情况

[sunchengquan 17:26:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |cut -c1|sort |uniq -c
   2184 A
   2203 C
   2219 G
   1141 N
   2253 T
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

将reads_1.fq转为reads_1.fa文件(即将fastq转化为fasta)

[sunchengquan 17:33:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"$0} else if(NR%4==2){print}}' reads_1.fq |head -2
>@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:45:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"substr($0,2)} else if(NR%4==2){print}}' reads_1.fq |head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG


[sunchengquan 17:46:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|head -2
@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:48:14 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|tr '@' '>'|head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

统计上述reads_1.fa文件中共有多少条序列

[sunchengquan 17:50:29 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $[`wc -l reads_1.fa |cut -d' ' -f1`/2]
10000
  • 1
  • 2
  • 3

计算reads_1.fa文件中总的碱基序列的GC数量

[sunchengquan 18:02:31 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fa |paste - - |cut -f2|grep -o '[GCgc]' |wc -l
529983
  • 1
  • 2
  • 3

删除reads_1.fa文件中每条序列的N碱基

[sunchengquan 18:04:46 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fa |tr -d "N" |grep N
  • 1
  • 2

删除reads_1.fa文件中含有N碱基的序列

[sunchengquan 18:09:32 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fa |paste - - |grep -v N|tr '\t' '\n'|less
  • 1
  • 2

删除reads_1.fa文件中短于65bp的碱基序列

[sunchengquan 18:18:44 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fa |paste - - |awk '{if(length($2)>=65){print $1"\n"$2}}'|tail -2
>r9999
TATCGCGCTGTGACGATGCTAATCCCAAACCTTACCCAACCCACCTGGTCACGGACTGTTAAGCCGCTGTATGACGCTCTGGTGGTGCAATGCCACAAAGAANAGTCAATC
  • 1
  • 2
  • 3
  • 4

删除reads_1.fa文件每条序列的前后五个碱基

echo "AGJKLOIUYTKIOFYTFFLHHJWERDFCVBNM" |awk '{print substr($0,6,length($0)-10)}'
OIUYTKIOFYTFFLHHJWERDF

[sunchengquan 18:28:20 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%2==0){print substr($0,6,length($0)-10)}}' reads_1.fa|head -1
GCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTT
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

删除reads_1.fa文件中的长于125bp的序列

[sunchengquan 18:28:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fa |paste - - |awk '{if(length($2)<=125){print $1"\n"$2}}'|tail -2
>r10000
GGTGATGCGCGGCTCCGTGCCGCCAAAGCCGTCCGGCACTGACTNGTCGCAG
  • 1
  • 2
  • 3
  • 4

查看reads_1.fq中每条序列的第一位碱基的质量值的平均值

[sunchengquan 19:09:46 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0) print}' reads_1.fq |cut -c1 >tmp

[sunchengquan 19:36:19 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat tmp |awk 'BEGIN{for(i=0;i<256;++i) ord[sprintf("%c",i)]=i}BEGIN{count=0}{ count+=ord[$1]-30}END{print count,count/NR}'
193621 19.3621
[sunchengquan 19:36:46 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat tmp |awk 'BEGIN{for(i=0;i<256;++i) ord[sprintf("%c",i)]=i}BEGIN{count=0}{ count+=(ord[$1]-30)}END{print count,count/NR}'
193621 19.3621
[sunchengquan 19:38:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat tmp |awk 'BEGIN{for(i=0;i<256;++i) ord[sprintf("%c",i)]=i}BEGIN{count=0}{ count+=(ord[$1])}END{print count,count/NR}'
493621 49.3621
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/360513
推荐阅读
相关标签
  

闽ICP备14008679号