python - Remove SNPs with wrong alleles -


i have 2 files this:

  1. the reference panel (referencepanel.csv)

    "id","position","allele0","allele1","allele1_frequency" "seq-rs1010355",55102179,"t","c",0.098 "seq-rs272408",55103603,"c","t",0.787 "seq-rs11669899",55104559,"a","t",0.029 "imm_19_59798585",55106773,"a","g",0.499 
  2. a bim file (myfile.bim)

    19    19:55102179    0    55102179    c    t 19    19:55103603    0    55103603    c    t 19    19:55104559    0    55104559    g    c 19    19:55106773    0    55106773       t 

i delete in bim file rows 2 alleles different reference panel. in other words, keep rows have same alleles reference panel - order not matter.

example:

reference allele:

"seq-rs1010355",55102179,"t","c",0.098 "seq-rs272408",55103603,"c","t",0.787 "seq-rs11669899",55104559,"a","t",0.029 "imm_19_59798585",55106773,"a","g",0.499 

bim file (myfile.bim)

19    19:55102179 0   55102179    c   t 19    19:55103603 0   55103603    c   t 19    19:55104559 0   55104559    g   c 19    19:55106773 0   55106773      t 

keep following rows:

19    19:55102179 0   55102179    c   t 19    19:55103603 0   55103603    c   t 

i managed extract positions reference panel using these lines:

#create empty list  positions=[]  #populate list positions  line in open("referencepanel.csv"):     columns = line.split(",")     positions.append(columns[1]) #remove first element corresponds header positions.pop(0) 

but stuck here. hope can me. thank in advance!

if you're not against using awk, can use following command:

awk -f'[",]*' 'nr==fnr && $4 && $5 {ref[$4][$5]=1} nr>fnr {fs=" *"} nr>fnr && ref[$6][$7]' reference.csv myfile.bim 

which resulting in:

19    19:55102179    0    55102179    c    t 19    19:55103603    0    55103603    c    t 19    19:55106773    0    55106773       t 

note last line matches 4th line of reference file (with a, t)

explanation:

-f'[",]*' matching csv delimiter parsing reference file

nr==fnr && $4 && $5 {ref[$4][$5]=1} getting c,t,g,a reference file

nr>fnr {fs=" *"} changing awk field separator spaces parse second file

nr>fnr && ref[$6][$7] printing line of second file if 6th , 7th column matching stored in array


Comments

Popular posts from this blog

sql - invalid in the select list because it is not contained in either an aggregate function -

Angularjs unit testing - ng-disabled not working when adding text to textarea -

How to start daemon on android by adb -