How to split large file and write into individual record using identical pattern perl? -
i have multi-gb file consisting of thousands of individual files based on ids.
each component file consists of 4 comment lines followed contents. every second commented lines has unique id. split file individual files named id.
there second size list
of ids , size. want have line written first first line in each output file.
examples
size list
a_1 100 bxx_xx 25 p_b 342 1a_z0 343 z867 200 bws 111
input file
# ver xx # query: a_1 # database: xx # usage: xx a_1 .* a_1 .* a_1 .* a_1 .* a_1 .* # ver # query: bxx_xx # database: xxxxxx # usage: xxxxx bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* # ver # query: p_b # database: xxxxxx # usage: xxxxx p_b.* p_b.* p_b.* p_b.* p_b.* p_b.* # ver # query: 1a_z0 # database: xxxxxx # usage: xxxxx 1a_z0.* 1a_z0.* 1a_z0.* 1a_z0.* # ver # query: z867 # database: xxxxxx # usage: xxxxx # ver # query: bws # database: xxxxxx # usage: xxxxx bws.* bws.* bws.*
output should this, (id.txt)
a_1.txt
a_1 100 # ver xx # query: a_1 # database: xx # usage: xx a_1 .* a_1 .* a_1 .* a_1 .* a_1 .*
bxx_xx.txt
bxx_xx 25 # ver # query: bxx_xx # database: xxxxxx # usage: xxxxx bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .* bxx_xx .*
p_b.txt
p_b 342 # ver # query: p_b # database: xxxxxx # usage: xxxxx p_b.* p_b.* p_b.* p_b.* p_b.* p_b.*
1a_z0.txt
1a_z0 343 # ver # query: 1a_z0 # database: xxxxxx # usage: xxxxx 1a_z0.* 1a_z0.* 1a_z0.* 1a_z0.*
z867.txt
z867 200 # ver # query: z867 # database: xxxxxx # usage: xxxxx
bws.txt
bws 200 # ver # query: bws # database: xxxxxx # usage: xxxxx bws.* bws.* bws.*
in cases, there may no contents after 4 lines. example,
# ver # query: z867 # database: xxxxxx # usage: xxxxx
still want them new file, z867.txt
my code follows
while ( $line = <bof> ) { chomp $line; $cpline = $line; next if ( $cpline =~ /^query/ ); if ( $cpline =~ /^#\squery\:\s(\w.*)/ ) { $query = $1; foreach $sizeline (@sizelist) { $sizeline =~ /^(\w.*)\t(\d+)$/; $seqid = $1; $seqlen = $2; if ( $seqid eq $query ) { print "query\t$seqlen\n"; } } } $cpline = ""; if ( $line =~ /^#/ ) { print "$line\n"; } if ( $line !~ /^#/ ) { if ( $line =~ /^((.+)\_.+)\t((.+)\_.+)\t(.+)\t(.+)\t.+\t.+\t.+\t.+\t.+\t.+\t.+\t\s?.+$/ ) { $queryid = $1; if ( $seqid eq $queryid ) { print "$line\n"; } } } }
i confused asking, perl code seems different question describes. however, here's simple solution opens new file every # query:
line in comment , generates output want
this program expects path input file parameter on command line
use strict; use warnings 'all'; use autodie; $out_fh; @header; while ( <> ) { if ( /^#/ ) { push @header, $_; if ( /query:\s*(\s+)/ ) { $file = "$1.txt"; print qq{creating "$file"\n}; open $out_fh, '>', $file; } if ( @header == 4 ) { print $out_fh @header; @header = (); } } elsif ( $out_fh ) { print $out_fh $_; } } close $out_fh;
output
creating "a_1.txt" creating "bxx_xx.txt" creating "p_b.txt" creating "1a_z0.txt" creating "z867.txt" creating "bws.txt"
update
here's new version of code complies revised specification. (please don't that.)
use strict; use warnings 'all'; use autodie; @argv = qw/ 4l.txt size_list.txt /; ( $input, $size_list ) = @argv; %sizes; { open $fh, '<', $size_list; while ( <$fh> ) { ($file, $size) = split; $sizes{$file} = $size if defined $size; } } $out_fh; @header; while ( <> ) { if ( /^#/ ) { push @header, $_; if ( /query:\s*(\s+)/ ) { $id = $1; $size = $sizes{$id}; die qq{no size found id "$id"} unless defined $size; $file = "$id.txt"; print qq{creating "$file"\n}; open $out_fh, '>', $file; print $out_fh "$id\t$size\n"; } if ( @header == 4 ) { print $out_fh @header; @header = (); } } elsif ( $out_fh ) { print $out_fh $_; } } close $out_fh if $out_fh;
Comments
Post a Comment