regex - In Oracle (PL/SQL), what is the most efficient way of identifying words in a regular expression? -

edit: should have been specific, need within pl/sql, more within functions/procedures within package.

i have routine process many (10s of millions) clobs looking words , phrases. routine optimized removing special characters , using space delimiter of choice. has become obvious need include @ least subset of special characters make text more readable, post processing. subset is;

()*+./

i starting rewrite routines. first bit of code able count words, , came with;

select regexp_count(regexp_replace('this (a test) of/special 20.5 some+odd characters.','(\(|\)|\*|\+|\. |\/)',' '),'[^ ]+')  dual;

i have not run benchmarks on suspect inefficient. there better way?

i need use regexp_replace, , regexp_substring in code. existing code this;

regexp_replace(regexp_substr(temp_text,'[^ ]+',1,ordinal_position),'(\.$|^\(|\)$)','') upper(regexp_substr(term,'[^ ]+',1,current_count))

notice previous code handled "." , partially handled "/".

have considered treating lexer instead of collection of string functions? might better handle things one-character-at-a-time.

building finite automata more work give more control , faster regular expressions. think more problem. example, make realize identifying "what space" not trivial problem, , many tools differently.

the regular expression approach may not work real clobs. pl/sql functions implicitly convert clobs varchar2 when possible. may work fine until test string on 32,767 bytes. conversion may throw error, truncate text, or return wrong characters.

unfortunately not trivial convert clob table of characters.

if database character set utf8, clobs use ucs. difference between utf8 , ucs irrelevant because implicit conversion between clobs , varchar2 handles perfectly. when you're forced use dbms_lob.substr may discover difficult problem - consider "one character" may "two characters" in clob.

for example, consider code:

declare   --the "cut" character in cantonese.  looks person sword.   v_4_byte_utf8 varchar2(1 char) := unistr('\d841\df79');   v_string varchar2(10 char) := v_4_byte_utf8||v_4_byte_utf8||v_4_byte_utf8;   v_clob clob := v_string; begin   dbms_output.put_line('whole string: '||v_clob);   dbms_output.put_line('first character: '||substrc(v_clob, 1, 1));   dbms_output.put_line('first character: '||dbms_lob.substr(v_clob, amount=>1, offset=>1)); end; /  whole string: 𠝹𠝹𠝹 first character: 𠝹 ora-22923: amount of data specified in streaming lob write 0

the third output generates error because first "character" has 2 ucs code units, grabbing amount of 1 doesn't make sense. likewise wouldn't want start wrong code unit.

oracle can detect problems when starting 1, or starting bad spot in string. unfortunately not throw error if end in bad spot. code has look-ahead, starting on end spot, , if there's error adjust one.

this seems madness - oracle expects know in clobs before retrieve data them? hope i'm wrong, until shows me better way, function use convert clobs characters.

create or replace type varchar2_table table of varchar2(1 char);  -------------------------------------------------------------------------------- --create nested table of characters. --this step takes care of non-trivial unicode processing front. --this cannot simplified substrc, not work large clobs. --todo: there easier way this? create or replace function get_varchar2_table_from_clob(p_clob clob) return varchar2_table     v_varchar2 varchar2(32767 byte);      v_chars varchar2_table := varchar2_table();      v_offset_not_on_char_boundary exception;     pragma exception_init(v_offset_not_on_char_boundary, -22831);      v_next_char_boundary number := 1;     v_amount_to_read constant number := 8000; begin     --return empty collection there's nothing.     if p_clob null         return v_chars;     --convert clob varchar2 easy way if it's small enough.     elsif dbms_lob.getlength(p_clob) <= 8191         v_varchar2 := p_clob;         in 1 .. lengthc(v_varchar2) loop             v_chars.extend();             v_chars(v_chars.count) := substrc(v_varchar2, i, 1);         end loop;     --convert clob varchar2 hard way if it's large.     else         --convert multiple characters clob varchar2 @ once.         --this tricky because clobs use ucs , varchars use utf8.         --some single-characters in varchar2 use 2 ucs code points.         --they can treated 2 separate characters must selected together.         --oracle not throw error if substr reads half character @ end.         --but error if starts @ bad character.         --the code below finds valid character boundary first, , reads it.         in 1 .. ceil(dbms_lob.getlength(p_clob)/v_amount_to_read) loop             begin                 --check if next boundary ok trying read small amount.                 --todo: checking 2 bytes expensive retrieving data.  pre-fetch , use later if valid?                 v_varchar2 := dbms_lob.substr(lob_loc => p_clob, offset => v_next_char_boundary + v_amount_to_read, amount => 2);                  --if it's ok, grab data , increment character boundary.                 v_varchar2 := dbms_lob.substr(lob_loc => p_clob, offset => v_next_char_boundary, amount => v_amount_to_read);                 v_next_char_boundary := v_next_char_boundary + v_amount_to_read;              --if wasn't successful, grab 1 less character , set character boundary 1 less.             exception when v_offset_not_on_char_boundary                 v_varchar2 := dbms_lob.substr(lob_loc => p_clob, offset => v_next_char_boundary, amount => v_amount_to_read - 1);                 v_next_char_boundary := v_next_char_boundary + v_amount_to_read - 1;             end;              --loop through varchar2 , convert array.             in 1 .. lengthc(v_varchar2) loop                 v_chars.extend();                 v_chars(v_chars.count) := substrc(v_varchar2, i, 1);             end loop;         end loop;     end if;      return v_chars;  end get_varchar2_table_from_clob; /

things become more sane when can deal things one-character-at-a-time.

that function open source project plsql_lexer, might helpful if want create lexer/scanner instead of regular expressions.

Search This Blog

Ant COmde

regex - In Oracle (PL/SQL), what is the most efficient way of identifying words in a regular expression? -

Comments

Post a Comment

Popular posts from this blog

sql - invalid in the select list because it is not contained in either an aggregate function -

Angularjs unit testing - ng-disabled not working when adding text to textarea -

How to start daemon on android by adb -