如何将原始SNP信息转化为0,1,2的矩阵形式_snp 0 1 2

作者：AllinToyou | 2024-04-21 14:36:18

踩

snp 0 1 2

导入示例数据

library(SNPassoc)
data(SNPs)
SNPs[1:8,1:8]1
2
3

id	casco	sex	blood.pre	protein	snp10001	snp10002	snp10003
1	1	Female	13.7	75640.52	TT	CC	GG
2	1	Female	12.7	28688.22	TT	AC	GG
3	1	Female	12.9	17279.59	TT	CC	GG
4	1	Male	14.6	27253.99	CT	CC	GG
5	1	Female	13.4	38066.57	TT	AC	GG
6	1	Female	11.3	9872.46	TT	CC	GG
7	1	Female	11.9	11132.90	TT	AC	GG
8	1	Male	12.4	29973.43	TT	AC	GG

提取SNP数据,并转化格式

这里比较重要的是，row.names这一列表示ID，里面的数据全是SNP数据

myDat<- SNPs[,-(2:5)]
row.names(myDat) <- myDat$id;
myDat <- myDat[,-1]
myDat[1:5,1:5]
# str(myDat)
myDat <- as.matrix(myDat)
1
2
3
4
5
6
7

snp10001	snp10002	snp10003	snp10004	snp10005
TT	CC	GG	GG	GG
TT	AC	GG	GG	AG
TT	CC	GG	GG	GG
CT	CC	GG	GG	GG
TT	AC	GG	GG	GG

利用synbreed包进行转化，可以补全缺失值，转化基因型

Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0.
利用等位基因频率对基因型进行转化，多的纯合体为0，杂合为1，少的纯合体为2

library(synbreed)
cp <- create.gpData(geno = myDat)
cp.dat <- codeGeno(gpData = cp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

1
2
3
4
5
6

   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 
1
2
3
4
5
6
7
8
9
10
11
12
13

如果报错说是多余两个基因型，那是因为没有考虑缺失值，需要保存到csv中，再读取进去

write.csv(myDat,"snps.csv")
ge <- read.csv("snps.csv",header = T,row.names = 1,na.strings = "NA")
summary(ge)
ge <- as.matrix(ge)
gp <- create.gpData(geno = ge)
cp.dat <- codeGeno(gpData = gp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

1
2
3
4
5
6
7
8
9

 snp10001 snp10002 snp10003   snp10004   snp10005 snp10006 snp10007 snp10008
 CC:12    AA: 5    GG  :144   GG  :156   AA: 3    AA:157   CC:157   CC:104  
 CT:53    AC:78    NA's: 13   NA's:  1   AG:70                      CG: 44  
 TT:92    CC:74                          GG:84                      GG:  9  

 snp10009  snp100010  snp100011 snp100012 snp100013  snp100014 snp100015
 AA  :72   TT  :147   CC:  1    CC  : 3   AA  :101   AA  :27   AG: 13   
 AG  :79   NA's: 10   CG:  2    CG  :68   AG  : 35   AC  :74   GG:144   
 GG  : 5              GG:154    GG  :84   GG  :  9   CC  :52            
 NA's: 1                        NA's: 2   NA's: 12   NA's: 4            
 snp100016  snp100017 snp100018 snp100019 snp100020 snp100021 snp100022 
 GG  :152   CC  : 5   CC  : 5   CC:32     AA:  9    GG:157    AA  :156  
 NA's:  5   CT  :83   CT  :84   CG:75     AG: 43              NA's:  1  
            TT  :67   TT  :67   GG:50     GG:105                        
            NA's: 2   NA's: 1                                           
 snp100023 snp100024 snp100025 snp100026  snp100027 snp100028 snp100029
 AA  : 5   CC  :14   CC:157    GG  :156   CC  :68   CC  :34   AA  :14  
 AT  :78   CT  :51             NA's:  1   CG  :82   CT  :72   AG  :48  
 TT  :71   TT  :91                        GG  : 5   TT  :50   GG  :94  
 NA's: 3   NA's: 1                        NA's: 2   NA's: 1   NA's: 1  
 snp100030 snp100031  snp100032 snp100033 snp100034 snp100035 
 AA:157    TT  :102   AA  :34   AA  :34   CC  :14   TT  :146  
           NA's: 55   AG  :70   AG  :69   CT  :48   NA's: 11  
                      GG  :52   GG  :49   TT  :94             
                      NA's: 1   NA's: 5   NA's: 1             


   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

查看一下转化后的结果

gee <- cp.dat$geno
gee[1:5,1:5]1
2

	snp10001	snp10002	snp10005	snp10009
1	0	0	0	0
2	0	1	1	1
3	0	0	0	0
4	1	0	0	0
5	0	1	0	1

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop】