python - Remove SNPs with wrong alleles -
i have 2 files this:
the reference panel (referencepanel.csv)
"id","position","allele0","allele1","allele1_frequency" "seq-rs1010355",55102179,"t","c",0.098 "seq-rs272408",55103603,"c","t",0.787 "seq-rs11669899",55104559,"a","t",0.029 "imm_19_59798585",55106773,"a","g",0.499
a bim file (myfile.bim)
19 19:55102179 0 55102179 c t 19 19:55103603 0 55103603 c t 19 19:55104559 0 55104559 g c 19 19:55106773 0 55106773 t
i delete in bim file rows 2 alleles different reference panel. in other words, keep rows have same alleles reference panel - order not matter.
example:
reference allele:
"seq-rs1010355",55102179,"t","c",0.098 "seq-rs272408",55103603,"c","t",0.787 "seq-rs11669899",55104559,"a","t",0.029 "imm_19_59798585",55106773,"a","g",0.499
bim file (myfile.bim)
19 19:55102179 0 55102179 c t 19 19:55103603 0 55103603 c t 19 19:55104559 0 55104559 g c 19 19:55106773 0 55106773 t
keep following rows:
19 19:55102179 0 55102179 c t 19 19:55103603 0 55103603 c t
i managed extract positions reference panel using these lines:
#create empty list positions=[] #populate list positions line in open("referencepanel.csv"): columns = line.split(",") positions.append(columns[1]) #remove first element corresponds header positions.pop(0)
but stuck here. hope can me. thank in advance!
if you're not against using awk
, can use following command:
awk -f'[",]*' 'nr==fnr && $4 && $5 {ref[$4][$5]=1} nr>fnr {fs=" *"} nr>fnr && ref[$6][$7]' reference.csv myfile.bim
which resulting in:
19 19:55102179 0 55102179 c t 19 19:55103603 0 55103603 c t 19 19:55106773 0 55106773 t
note last line matches 4th line of reference file (with a, t)
explanation:
-f'[",]*'
matching csv delimiter parsing reference file
nr==fnr && $4 && $5 {ref[$4][$5]=1}
getting c,t,g,a reference file
nr>fnr {fs=" *"}
changing awk
field separator spaces parse second file
nr>fnr && ref[$6][$7]
printing line of second file if 6th , 7th column matching stored in array
Comments
Post a Comment