python - Unexpected Performance Decrease -
i have parse huge (250 mb) text file, reason single line, causing every text editor tried (notepad++, visual studio, matlab) fail loading it. therefore read piece piece, , parse whenever logical line (starting #
) read:
f = open(filename, "rt") line = "" buffer = "blub" while buffer != "": buffer = f.read(10000) = buffer.find('#') if != -1: # end of line found line += buffer[:i] processline(line) line = buffer[i+1:] # skip '#' else: # still reading current line line += buffer
this works reasonably well, however, might happen, line shorter buffer, cause me skip line. replaced loop by
while buffer != "": buffer = f.read(10000) = buffer.find('#') while != -1: pixels += 1 line += buffer[:i] buffer = buffer[i+1:] processline(line) = buffer.find('#') line += buffer
, trick. @ least hundred times slower, rendering useless read large files. don't see, how can happen, have inner loop, of times repeated once. copy buffer (buffer = buffer[i+1:]
), somehow understand if performance dropped half, don't see how make 100 times slower.
as side note: (logical) lines 27.000 bytes. therefore, if buffer 10.000 bytes, never skip lines in first implementation, if 30.000, do. not seem impact performance, if inner loop in second implementation evaluated @ once, performance still horrible.
what going on under hood, miss?
if understood correctly want do, both versions of code wrong. @leon said in second version missing line = ""
after processline(line)
, , in first version first line correct, , sad if line shorter buffer, use first part of buffer in line += buffer[:i]
problem in line line = buffer[i+1:]
if line
1000 characters long, , buffer
10000 characters long, when use line += buffer[:i]
, line 9000 characters long containing more 1 line. reading:
"this works reasonably well, however, might happen, line shorter buffer, cause me skip line"
i think realised that, reason writing in detail, is reason why first version works faster.
after explaining that, think best read hole file , split text lines, code this:
f = open('textfile.txt', "rt") buffer = f.read() f.close() l = buffer.split('#')
and can use like:
for line in l: processline(line)
to list l
took me less 2 seconds.
ps: shouldn't have problems opening large files (like 250mb) notepad, opened 500mb files.
Comments
Post a Comment