Instrumentation … not just for debugging

Yeah I know.  You’re probably thinking “Here’s another blog post from someone telling us how important it is to instrument our code, so we can get better debugging, better performance, blah blah blah”.

If that is your sentiment, then I’m thrilled because it means the instrumentation message has got through to the extent that it is now considered just a statement of the obvious.  As many will know, my predecessor was a huge advocate of extensive instrumentation in your application code.  And there is no shortage of examples out there of other Oracle professionals who similarly wax lyrical on the incredible importance of well instrumented code.  For the record, I’m in that camp as well, and we now even have contributions to the Oracle community for everyone to have a robust instrumentation framework in their code.  So there is plenty of information out there, and it is pretty safe to say that there are no more excuses left for un-instrumented or poorly instrumented code.

So why the blog post ?  Well…I’ve got a slightly different assertion to make when it comes to instrumentation, namely, instrumentation can help your user interface.  You will be able to provide a better experience to the users of your application by having your code instrumented.  Now you might be thinking that I’m just banging the same drum here, that is, by instrumenting I’ll be fixing bugs faster, or picking up performance issues more easily, or collecting metrics to help improve application quality.  These are of course true benefits of instrumentation, but we all know that right ?

No…my motivation for this blog post stems from an activity on one of my laptop’s last week.

I went to upgrade it to Windows 10.  I fired off the process and after a short while, the following window appeared

image

and the familiar “something is happening” feedback swirling dots…

image

I have several laptops at home, so there was no need to sit and watch this one, so I left it and worked on another tackling some AskTom questions.  The laptop I upgrading just sits on the shelf, and occasionally plays Netflix for my children.  Hence, naturally enough… I totally forgot about it!

That was Saturday.  Now on Monday… 48 hours later, I’ve just been over to and waved the mouse to restore the display.  And this is what is on the screen:

image

My oh my …. there must be a lot of things to get ready Smile

The obvious question now is – has anything actually happened ?  Has the program hung ?  Or is it downloading a huge Windows 10 OS bundle ?  As a consumer of this “application” (the upgrade process) I have no idea.  There is nothing that tells me what state the application is in.  Now let me give the Microsoft developers the benefit of the doubt here, and make the assumption that the upgrade is extensively instrumented, and that if I looked in a logging file somewhere on the laptop hard drive, there would be a capture of the current state which would tell me whether any true progress is being made.  And that is typically how most us of implement instrumentation – a means to capture information for diagnosis.

But to do so is to miss an opportunity to directly assist the user.  Even if the content of the instrumentation would appear to be gibberish to a typical Windows consumer, just the presence of that detail on the user interface  becomes an improvement to the user experience.  If my upgrade dialog box had looked something like this:

image

then even if I have no idea what “getting platform details” means, the fact that I was on Step 17 at 9:00:12 helps me as a consumer, because I have immediate feedback as to whether progress is being made or not.  If I see that message at 9:05am, then I’m likely to let it sit a little longer to see if a “Step 18” ever eventuates.  But if I’m seeing that message at 5pm, then I can be reasonably sure that all is not well.

And no, I don’t think a swirling icon is sufficient.  We all used to criticize the “Progress Bars” on older versions of Windows and the like, when it would either exceed 100% or never get there, or jump back and forth.  So I can understand the motivation for the swirly icons we see in all modern devices and operating systems, because they avoid the task of giving any true prediction of completion.  And don’t get me wrong, I’m not saying that a return to progress bars is the solution, because that would be understating the difficultly of being able to know how long a task will take.  But simply taking advantage of the instrumentation code that I know you already have Smile, and presenting that in some way to the consumers of your application can be a better user experience for them.

Keep that in mind.

SUM is better than DISTINCT

There is a good chance that (based on this blog post title) that you’re expecting a post on SQL, and that’s understandable. But I’ll come clean nice and early – that was just to lure you in Smile

The post is about SUM and DISTINCT, but not in the technical sense.

A few days ago, fellow OakTable member Jonathan Lewis put a post on his blog: https://jonathanlewis.wordpress.com/2017/04/10/ask-jonathan/ where he is launching a mechanism where you can pose questions to him, and he will select topics of interest and write about them in the UKOUG Oracle Scene magazine.  What I found hilarious was that people started emailing me saying “Are you concerned about AskTOM?”.  Now, last time I checked, we don’t have Google Ads or Facebook Ads on AskTOM, so it is not as if we have some fiscal need to keep the website hits high, increase our “social engagement”, or whatever the latest buzzword is for that sort of thing.  If it was – you see lots of cat videos and references to Justin Bieber in our answers Smile

AskTOM does one thing…and one thing only – we try to help out the Oracle community, to make them more successful.  It’s what Tom did in the past, and it’s what the team do now.  That’s the same reason why our answers will often refer people to links / blog posts / tutorials outside of the oracle.com domain, for example, to the excellent work of some of the Oracle Aces and Ace Directors.  It’s about getting good information and getting the job done.  Similarly, when I give talks on AskTom to user groups, one of the things I encourage people to do is form their own “AskMe” concept within their own IT shops – get people asking questions, and discussing solutions at the department level, the organizational level, and ultimately the global community level.  Then we all benefit – we all win.

So rather than being “concerned”, I’m thrilled by anyone that wants to put their hand up and say “Hey, I’m here to help out the community”.  I encourage any of us who are experienced practitioners in the Oracle community to do the same.

The SUM of contributors to the Oracle community will be much better for that community than just a few DISTINCT people !  … and hence the title of the post.

SQL statements using literals

16 years ago, someone “Ask-ed Tom” how to find those SQL statements that were not using bind variables.   You can see the question here (because we don’t delete stuff ever Smile) but I’ll paraphrase the answer below:

Tom took the following approach

  • take a copy of SQL statements in the library cache
  • create a routine that would hunt for constants in the SQL text (that is, numbers and anything within quotes) and replace them with place holders
  • then count the resultant SQL’s for duplicates

create table t1 as select sql_text from v$sqlarea;

alter table t1 add sql_text_wo_constants varchar2(1000);

create or replace function 
remove_constants( p_query in varchar2 ) return varchar2
as
    l_query long;
    l_char  varchar2(1);
    l_in_quotes boolean default FALSE;
begin
    for i in 1 .. length( p_query )
    loop
        l_char := substr(p_query,i,1);
        if ( l_char = '''' and l_in_quotes )
        then
            l_in_quotes := FALSE;
        elsif ( l_char = '''' and NOT l_in_quotes )
        then
            l_in_quotes := TRUE;
            l_query := l_query || '''#';
        end if;
        if ( NOT l_in_quotes ) then
            l_query := l_query || l_char;
        end if;
    end loop;
    l_query := translate( l_query, '0123456789', '@@@@@@@@@@' );
    for i in 0 .. 8 loop
        l_query := replace( l_query, lpad('@',10-i,'@'), '@' );
        l_query := replace( l_query, lpad(' ',10-i,' '), ' ' );
    end loop;
    return upper(l_query);
end;
/
update t1 set sql_text_wo_constants = remove_constants(sql_text);

select sql_text_wo_constants, count(*)
  from t1
 group by sql_text_wo_constants
having count(*) > 100
 order by 2
/

It’s a utility I used many times over the years.  But times change and it’s always good to reflect on the tools and techniques you use, and whether anything has come along in later versions to assist you.  Consider the second bullet point above:

create a routine that would hunt for constants and replace them with place holders

That sounds a lot like the process that must be followed when “cursor_sharing” is enabled (and set to “force”), the only difference being the place holders would be come bind variables.  With that in mind, even if we are not using cursor_sharing, then the information that got added to the library cache when cursor_sharing was introduced all those years ago can now be used to assist us.

In V$SQLSTATS, there is a column FORCE_MATCHING_SIGNATURE which is a “signature” (or hash value) representing a SQL statement that has been converted to allow for cursor_sharing = force.  So that becomes the perfect grouping mechanism to identify repeated statements using literals.  For example:



SQL> select force_matching_signature, count(*)
  2  from  v$sqlstats
  3  where force_matching_signature > 0
  4  group by force_matching_signature
  5  having count(*) > 10
  6  order by 2 desc;

 FORCE_MATCHING_SIGNATURE   COUNT(*)
------------------------- ----------
      7756258419218828704         73
     15993438058742417605         16
     15893216616221909352         15
     14052954841849993177         12
     10493170372233636846         11

5 rows selected.

SQL> select sql_text from v$sqlstats
  2  where FORCE_MATCHING_SIGNATURE = 7756258419218828704;

SQL_TEXT
----------------------------------------------------------------
select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148525
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8448 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 6309 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148524
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8442 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8594 
and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 148693
 and bitand(FLAGS, 128)=0

select count(FA#) from SYS_FBA_TRACKEDTABLES where OBJ# = 8602 
and bitand(FLAGS, 128)=0

...

Identification of literal SQL has become trivial. Fixing it ? Well…that’s another story Smile

MERGE vs UPDATE/INSERT revisited

I wrote a few years back that for single row operations, MERGE might in fact have a large overhead than the do-it-yourself approach (ie, attempt an update, if it fails, then do an insert).

Just to show that it’s always good to revisit things as versions change, here’s the same demo (scaled up now because my laptop is faster Smile)

As you can see, there is still a little difference between between the two operations.  But even so, unless you need that absolute last little percentage of performance and you know the data distribution (and hence probability of updates versus inserts extremely well, in particular, you’re expecting near always updates) perhaps MERGE is the way to go, in that it is more accurately reflects the operation being performed, and is hence closer to the concept of “self-documenting” code.




SQL> drop table t1 purge;

Table dropped.

SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

SQL> -- all inserts
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.99

SQL> -- 50/50 updates and inserts
SQL
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.19
SQL> -- all updates
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i+1 y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.48
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:04.94
SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:22.79
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:18.08
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i+1 where x = i;
  4    if sql%notfound then insert into t1 values (i,i+1); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:13.88
SQL>
SQL>
SQL>

Active and Inactive Sessions

Most people are aware of the STATUS column in V$SESSION. If it’s ‘ACTIVE’ then that connection is in the process of consuming database resources (running a SQL statement etc).

However, a lesser known column which is probably even more useful is the LAST_CALL_ET column. The name suggests the time since the last call, but the documentation provides a more useful description:

If the session STATUS is currently ACTIVE, then the value represents the elapsed time (in seconds) since the session has become active.

If the session STATUS is currently INACTIVE, then the value represents the elapsed time (in seconds) since the session has become inactive.

 

Thus this is a great column to get a view on what sessions are doing on your system, for example


select s.sid||','||s.serial# sess, 
       s.USERNAME,
       s.last_call_et, 
       s.status, 
       s.sql_address, 
       s.program
from v$session s
where ( s.status = 'ACTIVE' and s.last_call_et > 10 ) or      -- has been active for 10 seconds or more
      ( s.status != 'ACTIVE' and s.last_call_et > 1200 );     -- has been inactive for 20 mins or more

The first matching row

I was recently asked when presented with a query along the lines:


SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1
 11  /

and asked – will the “ROWNUM” lead Oracle to be smart enough to stop after the “top” half of the UNION ALL if it gets row back ?

Let’s find out. First we create tablespaces for tables to be stored, and plonk a table in each.


SQL> create tablespace TS1 datafile 'C:\ORACLE\ORADATA\NP12\T1.DBF' size 10m;

Tablespace created.

SQL> create tablespace TS2 datafile 'C:\ORACLE\ORADATA\NP12\T2.DBF' size 10m;

Tablespace created.

SQL> create table T1 ( x number) tablespace ts1;

Table created.

SQL> insert /*+ APPEND */ into T1
  2  select rownum from dual connect by level < 10;

9 rows created.

SQL> create table T2 ( x number) tablespace ts2;

Table created.

SQL> insert /*+ APPEND */ into T2
  2  select rownum from dual connect by level < 10;

9 rows created.

Now the big question is, if I am doing a “find the first row and exit” style of query, will Oracle be smart enough to stop before it hits table T2 ?


SQL> alter tablespace ts2 offline;

Tablespace altered.

SQL> select * from t2;
select * from t2
              *
ERROR at line 1:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9: 'C:\ORACLE\ORADATA\NP12\T2.DBF'


SQL> variable b1 number
SQL> exec :b1 := 1

PL/SQL procedure successfully completed.

SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1;

         X
----------
         1

Yes it does! And just to reinforce it, I’ll reset “b1” to zero, so it will not find anything in T1 and hence will need to move on to T2


SQL> variable b1 number
SQL> exec :b1 := 0

PL/SQL procedure successfully completed.

SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1;
  from   t2
         *
ERROR at line 7:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9: 'C:\ORACLE\ORADATA\NP12\T2.DBF'


SQL>

Nice to know also that even if we need to do dynamic sampling, we’re still not going to get in trouble. I’ll repeat the tests without the APPEND hint, so the tables do not pick up auto statistics.


SQL> create tablespace TS1 datafile 'C:\ORACLE\ORADATA\NP12\T1.DBF' size 10m;

Tablespace created.

SQL> create tablespace TS2 datafile 'C:\ORACLE\ORADATA\NP12\T2.DBF' size 10m;

Tablespace created.

SQL> create table T1 ( x number) tablespace ts1;

Table created.

SQL> insert into T1
  2  select rownum from dual connect by level < 10;

9 rows created.

SQL> create table T2 ( x number) tablespace ts2;

Table created.

SQL> insert into T2
  2  select rownum from dual connect by level < 10;

9 rows created.

SQL> alter tablespace ts2 offline;

Tablespace altered.

SQL> select * from t2;
select * from t2
              *
ERROR at line 1:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9: 'C:\ORACLE\ORADATA\NP12\T2.DBF'


SQL> variable b1 number
SQL> exec :b1 := 1

PL/SQL procedure successfully completed.

SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1;

         X
----------
         1

SQL> variable b1 number
SQL> exec :b1 := 0

PL/SQL procedure successfully completed.

SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1;
  from   t2
         *
ERROR at line 7:
ORA-00376: file 9 cannot be read at this time
ORA-01110: data file 9: 'C:\ORACLE\ORADATA\NP12\T2.DBF'


SQL>
SQL> alter tablespace ts2 online;

Tablespace altered.

SQL> set autotrace on
SQL> variable b1 number
SQL> exec :b1 := 0

PL/SQL procedure successfully completed.

SQL> select * from
  2  ( select x
  3    from   t1
  4    where  x = :b1
  5    union all
  6    select x
  7    from   t2
  8    where  x = :b1
  9  )
 10  where rownum = 1;

no rows selected


Execution Plan
----------------------------------------------------------
Plan hash value: 957534384

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |    13 |     2   (0)| 00:00:01 |
|*  1 |  COUNT STOPKEY       |      |       |       |            |          |
|   2 |   VIEW               |      |     1 |    13 |     2   (0)| 00:00:01 |
|   3 |    UNION-ALL         |      |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| T1   |     1 |    13 |     3   (0)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL| T2   |     1 |    13 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM=1)
   4 - filter("X"=TO_NUMBER(:B1))
   5 - filter("X"=TO_NUMBER(:B1))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

Forever is a long time

It’s interesting that we still see lots of code out “in the wild” that leans itself to taking forever. Now I’m not talking here about poorly performing SQL etc…because (at least theoretically) no matter how bad the SQL is, presumably it will finish one day, even if that day is years into the future. What I’m referring to here is the default mechanism for locking in Oracle.

If you try access a row that is locked by someone else, you will wait. Don’t get me wrong … that’s a very good thing. The alternative – of letting two people change the same data at the same time is equivalent to saying “I dont care about my data at all”.

And how long will we wait ? Well… we’ll wait forever. We will never stop waiting, until that lock has been released.

I don’t know about you … but for me… forever is long time. A really long time. I cannot recall ever seeing any kind of Service Level Agreement for an application’s response times as stating “Yup…forever is cool. Take as long as you want” Smile

So perhaps consider that when you’re building your applications in Oracle. Yes, the default is to wait forever, but a little code (which has been available since version 9) can really go a long way.

Rather than just coding:

select * from T where …for update

or

delete from T where …

You can opt to wait for a nominated amount of time to attempt to get that lock (and then report something sensible back to the user).


SQL> select * from T for update wait 60;

(60 seconds pass)

select * from T for update wait 60
*
ERROR at line 1:
ORA-30006: resource busy; acquire with WAIT timeout expired

Note that the error code is not the same as the error you could get with a NOWAIT clause, which is:

ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

But you don’t even need to stop there. Why not catch that error with a SERVERERROR trigger – and then do some session level probing to get more information. You could even empower the users to take their own action 🙂


SQL> create or replace trigger lock_expiry
  2  after servererror on database
  3  declare
  4    l_err varchar2(4000);
  5  begin
  6    if ( is_servererror(30006) ) then
  7      for i in (
  8      select /*+ leading(lk sess) */ sess.sid, sess.username, sess.module
  9      from   v$lock lk,
 10             v$session sess,
 11             v$session s
 12      where  lk.id1 = s.row_wait_obj#
 13      and    lk.type = 'TM'
 14      and    lk.sid = sess.sid
 15      and    s.sid = sys_context('USERENV','SID')
 16      )
 17      loop
 18        l_err := l_err || 'Session '||i.sid||' who is '||i.username||' using '||i.module || chr(10);
 19      end loop;
 20      raise_application_error(-20000,l_err);
 21    end if;
 22  end;
 23  /

Trigger created.

SQL> select * from t for update wait 60;
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-20000: Session 717 who is MCDONAC using SQL*Plus
ORA-06512: at line 18
ORA-30006: resource busy; acquire with WAIT timeout expired

Just for completeness, note that if you having locking issue, V$SESSION gives you direct access to find out who is blocking you.


SQL> select sid, last_call_et,
  2         nvl2(lockwait,'BLOCKED',status) status,
  3         blocking_session
  4  from   v$session s;

       SID LAST_CALL_ET STATUS     BLOCKING_SESSION 
---------- ------------ ---------- ---------------- 
        39          376 INACTIVE                    
        40          412 INACTIVE
        41          412 INACTIVE
        44          421 INACTIVE
        46          340 BLOCKED                  39 
        49            4 ACTIVE
        50          453 INACTIVE
        51          453 INACTIVE