当前位置:   article > 正文

如何将原始SNP信息转化为0,1,2的矩阵形式_snp 0 1 2

snp 0 1 2

导入示例数据

library(SNPassoc)
data(SNPs)
SNPs[1:8,1:8]
  • 1
  • 2
  • 3
idcascosexblood.preproteinsnp10001snp10002snp10003
1 1 Female 13.7 75640.52TT CC GG
2 1 Female 12.7 28688.22TT AC GG
3 1 Female 12.9 17279.59TT CC GG
4 1 Male 14.6 27253.99CT CC GG
5 1 Female 13.4 38066.57TT AC GG
6 1 Female 11.3 9872.46TT CC GG
7 1 Female 11.9 11132.90TT AC GG
8 1 Male 12.4 29973.43TT AC GG

提取SNP数据,并转化格式

这里比较重要的是,row.names这一列表示ID,里面的数据全是SNP数据

myDat<- SNPs[,-(2:5)]
row.names(myDat) <- myDat$id;
myDat <- myDat[,-1]
myDat[1:5,1:5]
# str(myDat)
myDat <- as.matrix(myDat)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
snp10001snp10002snp10003snp10004snp10005
TTCCGGGGGG
TTACGGGGAG
TTCCGGGGGG
CTCCGGGGGG
TTACGGGGGG

利用synbreed包进行转化,可以补全缺失值,转化基因型

Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0.
利用等位基因频率对基因型进行转化,多的纯合体为0,杂合为1,少的纯合体为2

library(synbreed)
cp <- create.gpData(geno = myDat)
cp.dat <- codeGeno(gpData = cp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

如果报错说是多余两个基因型,那是因为没有考虑缺失值,需要保存到csv中,再读取进去

write.csv(myDat,"snps.csv")
ge <- read.csv("snps.csv",header = T,row.names = 1,na.strings = "NA")
summary(ge)
ge <- as.matrix(ge)
gp <- create.gpData(geno = ge)
cp.dat <- codeGeno(gpData = gp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
 snp10001 snp10002 snp10003   snp10004   snp10005 snp10006 snp10007 snp10008
 CC:12    AA: 5    GG  :144   GG  :156   AA: 3    AA:157   CC:157   CC:104  
 CT:53    AC:78    NA's: 13   NA's:  1   AG:70                      CG: 44  
 TT:92    CC:74                          GG:84                      GG:  9  

 snp10009  snp100010  snp100011 snp100012 snp100013  snp100014 snp100015
 AA  :72   TT  :147   CC:  1    CC  : 3   AA  :101   AA  :27   AG: 13   
 AG  :79   NA's: 10   CG:  2    CG  :68   AG  : 35   AC  :74   GG:144   
 GG  : 5              GG:154    GG  :84   GG  :  9   CC  :52            
 NA's: 1                        NA's: 2   NA's: 12   NA's: 4            
 snp100016  snp100017 snp100018 snp100019 snp100020 snp100021 snp100022 
 GG  :152   CC  : 5   CC  : 5   CC:32     AA:  9    GG:157    AA  :156  
 NA's:  5   CT  :83   CT  :84   CG:75     AG: 43              NA's:  1  
            TT  :67   TT  :67   GG:50     GG:105                        
            NA's: 2   NA's: 1                                           
 snp100023 snp100024 snp100025 snp100026  snp100027 snp100028 snp100029
 AA  : 5   CC  :14   CC:157    GG  :156   CC  :68   CC  :34   AA  :14  
 AT  :78   CT  :51             NA's:  1   CG  :82   CT  :72   AG  :48  
 TT  :71   TT  :91                        GG  : 5   TT  :50   GG  :94  
 NA's: 3   NA's: 1                        NA's: 2   NA's: 1   NA's: 1  
 snp100030 snp100031  snp100032 snp100033 snp100034 snp100035 
 AA:157    TT  :102   AA  :34   AA  :34   CC  :14   TT  :146  
           NA's: 55   AG  :70   AG  :69   CT  :48   NA's: 11  
                      GG  :52   GG  :49   TT  :94             
                      NA's: 1   NA's: 5   NA's: 1             


   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

查看一下转化后的结果

gee <- cp.dat$geno
gee[1:5,1:5]
  • 1
  • 2
snp10001snp10002snp10005snp10008snp10009
100000
201101
300000
410000
501001
  • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/463515
推荐阅读
相关标签
  

闽ICP备14008679号