Business was complaining that runtime of a periodic SAP background job was suddenly jumped in production environment. There is no code change and no volume change and other jobs and transactions are running well. They were asking why this happened. I was consulted and looked into the situation. Finding the issue was related to Oracle execution plan change – Oracle chose a suboptimal execution plan. The plan changes was triggered by a regular table statistics update. Job run time was back to normal after plan was switched to better one following restoring previous table statistics. This blog would focus on how I trouble this particular SAP job performance issue. It would talk:
- The performance issue – Job is running up to 10+ times longer than it used to be,
- The performance trouble-shooting – why this job is long running?
- Going deeper – Why is the execution plan not efficient?
- Fix performance issue – action and result,
- What we learn from this case? and
- Further clarification.
Please continue read should you be interested in details.
1 The performance issue – Job is running up to 10+ times longer than it used to be
A job which runs every 15 minutes used to take less than 300 seconds finish was taking at least 600 seconds and up to 2,300 seconds to complete. Please refer to following screen shot to get more understanding on our issue.
Figure 1 SM37 job log – runtime history
Figure 1 shows that job run time had a jump since 00:15:02 on Apr 24 2015. There is no code change, no volume change and system has no resource contention… Why would the job run time have such a huge jump?
2 The performance trouble-shooting – why is this job long running now?
Here, performance trace was done on the job to find out where the time is spent by the program. The performance trace shows that job spent most of time on executing one select-SQL statement, and SQL summary for the trace indicates that number of records returned by the SQL is not high, this is unlikely a volume issue, so I examined the SQL execution plan – identify the hash join used by Oracle might be a bad choice, and checked SQL execution plan history and found that the timing of execution plan change is collated to timing when job started to have performance issue. Further I found that table statistics of one of the underlying table was updated immediately before the moment when the execution plan was changed. With that, I concluded that job runtime was due to table statistics update which led to reparse the SQL and Oracle changed SQL execution plan and used a suboptimal execution plan.
2.1 Where is time spent by the long running job?
Performance trace done showed that job was spending 99.9% of run time on database operations as below.
Figure 2 ST12 ABAP trace – why would job runtime jump
99.6% of time is on single SQL statement “Select VTTP” based on above trace. So this SQL is responsible for job long run time. Why was execution of this SQL taking so long ? ( Tried 3 times, can not upload my figure 2…)
2.2 Why is this SQL execution running long?
Is this due to inefficient index or due to a lot of database entries are retrieved? We need to check that with SQL trace. Following is SQL summary from SQL trace.
Figure 3 SQL summary – why would job runtime jump
From above trace, we know number of records fetched from the database by the top expensive SQL is “0”. So it is not due to a lot of records are fetched and the SQL is executed once. Is it due to wrong index used etc.?
Now let’s review the execution plan…
Figure 4 sql execution plan – why would job runtime jump
The index used looks ok in Figure 4…but when I compared number of records (0) returned by the SQL in Figure 3 and number of records(7,971) projected by Oracle during plan generation, I smell something wrong… This looks like an Oracle execution plan issue. Since this job was working well, this must be due to recent Oracle execution plan changes of this particular SQL. Is that true?
2.3 Was the SQL long running due to oracle execution plan changes?
Using Program /sdf/RSORADLD_NEW( there are SQL Scripts available for this purpose), I got SQL plan history as Figure 5 (right side).
Figure 5 TIming of job runtime change and timing of execution plan changes
Figure 5 shows that timing when execution plan was refreshed and timing when the job started to run long are closely related. It could be that Oracle reparsed the SQL statement and refreshed the sql execution prior to the sql was executed. But Figure 5 does not tell what the execution plan was used to execute the job prior to Apr 24 2015.
Following load history data shows that the execution of this SQL were captured on Apr 24
Figure 6 SQL load history – why would job run time jump
The job is executed every 15 minutes and every day. Yet the load history of this particular SQL was only captured on Apr 24. In our SAP system, we follow normal approach – only top expensive SQLs are captured into the snap-shot. So this means that this particular SQL was NOT one of top expensive SQLs prior to Apr 24.
2.4 Why would Oracle plan change execution plan?
Following screen shows that table VTTK statistics was updated at 00:08 on Apr 24, 2015. That timing is exactly linked to the timing when the execution plan was changed.
Figure 7 table statistics update – why would job runtime jump
There are other reasons which can lead to refreshing of Oracle execution plan. But this case, the execution plan refreshing is a result of Oracle table statistics update. And refreshing of Oracle execution plan led to a suboptimal plan in this case – leading to job runtime jump.
2.5 Conclusion on why job is long running
Job was running long after 00:15 on Apr 24 2015 was due to the fact that job is spending significant more time on executing one Select-SQL statement. Execution plan of this particular SQL was changed/refreshed at 00:16:53 in the same day. VTTK table statistics was updated a few minutes earlier at 00:08 in the same day, Table statistics update led to reparse related SQL statement – that is a typical Oracle response. New execution plan as a result of statistics update is not efficient as the original one. That made job run longer.
So why is the new plan not good and what is the better plan? I would try to take a closer look on the execution plan.
3 Going deeper – why is the execution plan not efficient?
In this section, I use sample values for the binding variables, number of table entries and st04 SQL cache data to validate the SQL execution plan which is in question.
In previous section, it is noticed that hash plan is used in several places. Hash operation is normally for data operation on two dataset when a lot of records are returned. Let’s take a peek on Oracle binding variable used in this SQL
Figure 8 Value for Oracle binding variable – why would job runtime jump
A4 is field vttk-shtyp and A5 is vttk-fbgst, following are total entries of VTTK and entry which is meeting selection criteria (Figure 7). Review table definition, FBGST is a field to store overall status of shipment and shipment stage. Value “A” for that field stands for “not processed” status. So the job looks like only interested in “new” shipment. Based on load history and SQL cache, we know that execution of this SQL returned “0” records most of times.
Figure 9 SQL cache – why would job runtime jump
Following screen shots indicates number of records from VTTK table which is meeting the selection criteria..
Figure 10 table size and number of records – why would job runtime jump
In the current execution plan, since hash operation is used, Oracle would prepare two dataset first prior to apply hash operation. When one data set has no or small number of records and the other data set can be read via efficient index, nested-loop operation is more efficient than hash join. So hash join operation at Step 7 in Figure 7 is not efficient as nested-loop. I did check the execution plan of this SQL in our testing environment. The plan was different from our production box. I put both plan side by side as Figure 11.
The plan in the testing box is better, when no records is returned from hash operation using VTTK and VTTP, the reading on lips would not executed via Nested loops operation.. so reading LIPS is spared.
Figure 11 SQL execution plan comparison – why would job runtime jump
Execution plan major difference is highlighted between test box and production box. It is mainly at step 4 where testing box is using nested loop while production box is using hash join. Due to hash join, production has to prepare datasets from LIPS and VTTP tables via another hash operation which is expensive based on selection criteria used and table size. While in the nested loop, step 9 and would be spared when there is no records. You might agree with me, If oracle replaces all Hash operation with nested-loop here, the performance could be even better.
4 Fix the performance issue – action and result
The table statistics was restored to the version prior to the problematic updating… after that, the execution plan in our production box is changed to the plan we are seeing in our testing box. The job performance is back to normalJ
Figure 12 Job runtime back to normal with restored execution plan
Figure 13 plan history after restoring statistics
Plan history shows that restoring statistics changed production plan to the same as what we are seeing in test box.
5 What do we learn from this case?
An unexpected SQL execution plan change could be very likely the reason that a job runtime jumps especially when following conditions meet
- Other jobs or programs are running as normal.
- Job spends most of time on database side or database time has a significant increase – STAD or ST03N,
- Most of database time spent by the job is due to database read operation – STAD
- There is no code change or job variant or volume change.
You can validate whether there is a plan change by checking plan history and
- Check whether timing of plan change matches the moment when job runtime starts to jump
Oracle regenerate SQL execution plan when underling table statistics is updated. Table statistics updating is the typical reason which leads to an execution plan refreshing. You can validate the timing of statistics update and timing of execution plan change. SAP application performance issue related to table statistics change can be fixed by playing with table statistics to influence SQL execution plan.
6 Further clarification
Oracle SQL plan management can prevent unexpected plan changes to avoid SAP application performance issue due to accidental execution plan change. Many factors can influence a job run time like inappropriate code change, lock contention with other jobs etc.
In this particular case, a performance trace was done to understand the performance issue. But actually, there is no need to do performance trace to fix the performance issue. SAP tool STAD, ST04 and tools based on Oracle AWR should be enough to trouble-shoot this type of performance issue. I am writing another post on how to use SAP STAD. In that post, I would give some insight on how to get this doneJ
It would be nice for us to know the execution plan of this particular SQL used to execute the job before Apr 24 2015. Apparently this was not possible in our system. ST04 and Oracle AWR tools in our SAP system does not show us complete plan history. Oracle plan history table does not contain data to show us what the plan was before the issue. What could be the reasons? At this moment, I am guessing this might be related to our setting on how many SQL should be captured into AWR. Not quite sure of that.