How to split large file and write into individual record using identical pattern perl? -

- March 15, 2014

i have multi-gb file consisting of thousands of individual files based on ids.

each component file consists of 4 comment lines followed contents. every second commented lines has unique id. split file individual files named id.

there second size list of ids , size. want have line written first first line in each output file.

examples

size list

a_1 100 bxx_xx  25 p_b 342 1a_z0   343 z867    200 bws 111

input file

# ver xx # query: a_1 # database: xx # usage: xx a_1 .* a_1 .* a_1 .* a_1 .* a_1 .* # ver # query: bxx_xx # database: xxxxxx # usage: xxxxx bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* # ver # query: p_b # database: xxxxxx # usage: xxxxx p_b.* p_b.* p_b.* p_b.* p_b.* p_b.* # ver # query: 1a_z0 # database: xxxxxx # usage: xxxxx 1a_z0.* 1a_z0.* 1a_z0.* 1a_z0.* # ver # query: z867 # database: xxxxxx # usage: xxxxx # ver # query: bws # database: xxxxxx # usage: xxxxx bws.* bws.* bws.*

output should this, (id.txt)

a_1.txt

a_1 100 # ver xx # query: a_1 # database: xx # usage: xx a_1 .* a_1 .* a_1 .* a_1 .* a_1 .*

bxx_xx.txt

bxx_xx  25 # ver # query: bxx_xx # database: xxxxxx # usage: xxxxx bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .* bxx_xx  .*

p_b.txt

p_b 342 # ver # query: p_b # database: xxxxxx # usage: xxxxx p_b.* p_b.* p_b.* p_b.* p_b.* p_b.*

1a_z0.txt

1a_z0   343 # ver # query: 1a_z0 # database: xxxxxx # usage: xxxxx 1a_z0.* 1a_z0.* 1a_z0.* 1a_z0.*

z867.txt

z867    200 # ver # query: z867 # database: xxxxxx # usage: xxxxx

bws.txt

bws 200 # ver # query: bws # database: xxxxxx # usage: xxxxx bws.* bws.* bws.*

in cases, there may no contents after 4 lines. example,

# ver # query: z867 # database: xxxxxx # usage: xxxxx

still want them new file, z867.txt

my code follows

while ( $line = <bof> ) {      chomp $line;     $cpline = $line;      next if ( $cpline =~ /^query/ );      if ( $cpline =~ /^#\squery\:\s(\w.*)/ ) {          $query = $1;          foreach $sizeline (@sizelist) {              $sizeline =~ /^(\w.*)\t(\d+)$/;             $seqid  = $1;             $seqlen = $2;              if ( $seqid eq $query ) {                 print "query\t$seqlen\n";             }         }     }      $cpline = "";      if ( $line =~ /^#/ ) {         print "$line\n";     }      if ( $line !~ /^#/ ) {          if ( $line =~ /^((.+)\_.+)\t((.+)\_.+)\t(.+)\t(.+)\t.+\t.+\t.+\t.+\t.+\t.+\t.+\t\s?.+$/ ) {              $queryid = $1;              if ( $seqid eq $queryid ) {                 print "$line\n";             }         }     } }

i confused asking, perl code seems different question describes. however, here's simple solution opens new file every # query: line in comment , generates output want

this program expects path input file parameter on command line

use strict; use warnings 'all'; use autodie;  $out_fh; @header;  while ( <> ) {      if ( /^#/ ) {          push @header, $_;          if ( /query:\s*(\s+)/ ) {             $file = "$1.txt";             print qq{creating "$file"\n};             open $out_fh, '>', $file;         }          if ( @header == 4 ) {             print $out_fh @header;             @header = ();         }     }     elsif ( $out_fh ) {         print $out_fh $_;     } }  close $out_fh;

output

creating "a_1.txt" creating "bxx_xx.txt" creating "p_b.txt" creating "1a_z0.txt" creating "z867.txt" creating "bws.txt"

update

here's new version of code complies revised specification. (please don't that.)

use strict; use warnings 'all'; use autodie;  @argv = qw/ 4l.txt size_list.txt /;  ( $input, $size_list ) = @argv;  %sizes; {     open $fh, '<', $size_list;     while ( <$fh> ) {         ($file, $size) = split;         $sizes{$file} = $size if defined $size;     } }   $out_fh; @header;  while ( <> ) {      if ( /^#/ ) {          push @header, $_;          if ( /query:\s*(\s+)/ ) {              $id = $1;             $size = $sizes{$id};             die qq{no size found id "$id"} unless defined $size;             $file = "$id.txt";              print qq{creating "$file"\n};              open $out_fh, '>', $file;             print $out_fh "$id\t$size\n";         }          if ( @header == 4 ) {             print $out_fh @header;             @header = ();         }     }     elsif ( $out_fh ) {         print $out_fh $_;     } }  close $out_fh if $out_fh;

Search This Blog

Ant COmde