Category Archives: Case study

share how to use one or several performance tools to trouble-shooting a real performance issue

Many SAP work processes were left in “on hold” status with RFC info

I was contacted on one production issue – Almost all SAP background work processes were in “on-hold” status with RFC info in SAP SM50 work process monitor transaction and seemed like not moving. I was asked why? Please continue reading to understand the reason which I identified through analysis via transaction SM50, STAD, RZ12. This post would cover:

  • The issue – SAP work process in “on hold” status with RFC info,
  • Trouble-shooting of SAP work process in “on hold status”,
  • Conclusion,
  • Impact of SAP work process on hold and
  • Further clarification

1. The issue

SAP workload monitor are showing that majority of SAP background work process were in “on-hold” status with status information or reason “RFC response” as showed in below Figure 1.

Figure 1 SAP SM50 – WORK PROCESS IN “ON HOLD” STATUS WITH “RFC”

If you need help on SAP SM50, you can refer to my post on how to run SAP SM50/SM66.

2. Trouble-shooting

SM50 in Figure 1 shows BTC work processes on hold were related to “RFC Response”. Figure 1 also tell us all Background processes were busy, that should be a result of long running BTC due to “RFC Call” performance. Are those RFC calls to external system? Or Are those RFC calls to internal system? Do we have RFC resource issue locally or do we have RFC resource issue remotely?

2.1 System was having No Free DIA WP for handling RFC call

So I checked whether the system has free DIA work processes for handling RFC call via SAP transaction SPBT. You can check it via SAP SMQS as well.

Figure 2 SAP SPBT – RFC resource availabilities check

Figure 2 shows that the system allows up to 25 DIA WP for RFC processing but at the moment, there was no free DIA WP for handling any new “RFC” call because “ALL PBT resources are currently busy”. So you might wonder who was exhausting system RFC resources and what we can do to fix or mitigate the situation.

SAP transaction SPBT is mentioned in section of how to validate RFC configuration in my post how to configure RFC server group via RZ12.

2.2 Who were exhausting system RFC resources?

Who were exhausting the system RFC resources? I checked this via SAP workload monitor SM50. Following is the result

Figure 3 SAP SM50 – DIA WP consumption

Figure 3 told us that most of DIA work processes were occupied by one single user – this is not a normal online user. Then you might wonder origin of those RFC calls.

2.3 Where were origin of RFC calls?

I used STAD to find out what SAP job/program were firing those RFC calls or where were those RFC calls from?

Figure 4 STAD – RFC calls

You can normally find the origin/parent of RFCs via SAP transaction STAD, Display details info and then check “Client” info. In this case, we found that RFC showed in Figure 4 were coming from another SAP system – where many work processes were running under a single online SAP user which needs information from the system I was looking into.

3. Conclusion

The reason that SAP work processes were in “Stop” or “on hold” status with a reason of “RFC response” was due to contention of RFC resources in the system. Available RFC resources (DIA work processes) were occupied by a storming RFC calls from a remote SAP system.

4. Impact of SAP background work process in “On-hold”

Impact is on two areas – SAP background job could not start on time in the system due to shortage of background process created by the on-hold or stop status and job started would run longer if it needed to make RFC calls.

4.1 Long running of back ground processes

Following screen shows processing an idoc took over 1 and half hours while it can normally finished in seconds.

Figure 5 Background process ran longer than normal

Figure 6 SAP STAD – Work process with big RFC time

Based on Figure 6, we know over 99.9% of runtime is spent on RFC call. Further review, you can know that RBDAPP01 is making a RFC call which was supposed to be executed locally and it spent most of RFC time on waiting. Please refer to my post on explanation of STAD statistical data if you would like to know details.

4.2 Long waiting of background processes

“Delay” column in SAP SM37 Screen shows how long a background job has to wait before it is dispatched to an idle BTC work process. Following screen shot is tried to give you an impact of the issue.

Figure 7 SAP SM37 – Job delay due to shortage of BTC work processes created by RFC issue

You can see that job delay is varied – that depends on job schedule time and status of system.

5 Further clarification

Technically, you can fix the issue immediately by terminating the corresponding program/process in the remote system. But in our case, I just added 8 more DIA WPs to RFC resources VIA RZ12 since almost no DIA sap users were online at that earlier AM system time, the issue were gone by itself about half hours after the addition. So in this situation, just waiting and let time cure the issue if there is no business impact some times. Too many RFC work processes can cause SAP work process dead-locks sometimes.

A SAP work process can be in “RFC” status for many reasons like network, remote server issue, expensive RFC call etc. But this case, it is due to local resource issue.

Normally more BTC work processes can be seen left in “on hold” or “stop” status than DIA WP. A DIA WP would be rolled out when it is in “waiting” status after connection is established, so it is available for other online/RFC request. This is a typical SAP design. Interact activities executed by online SAP users are conducted through SAP DIA work process. You normally has much more SAP online users than available SAP DIA work process.

Further follow-up is needed to understand the interface configuration and volume control and end user activities in the remote system, so a system configuration tuning or program design change can be made to avoid the reoccurrence of this issue.

Mass VS Individual deletion – Does it matter to SAP program performance?

My previous post – “SAP ST12 Trace – SQL performance analysis” mentions that we can use mass/array database operation to replace many individual execution of the same SQL to improve database access performance. This blog will go through a true business case where a code has been changed to use array/mass SQL deletion to replace a repeated executed single deletion to delete records from database table. After the change, the business job runtime has over 90% improvement. This post would cover

  1. The background of SAP ABAP program performance tuning,
  2. The solution of SAP program performance tuning and
  3. The result of SAP program performance tuning.

1. The background

Volume of a business process is going to increase. To prepare for higher volume, we are requested to check whether the program performance can be further tuned. For that purpose, the business program is tested and being traced via SAP st12 transaction. Following (Figure 1) is a part of SAP ST12 ABAP trace screen showing top hit list based on “gross %”. And Figure 2 is a part of SQL summarized screen of the same execution.

Figure 1 SAP ST12 ABAP trace – many execution of SQL deletions

Figure 2 SAP ST12 SQL Summary

Clearly from above trace, we can see improvement of this program can only come from changes which can reduce database time which counts for 99.4% of runtime. And database time spent by the program is from 3 SQL delete statements which are from functional module /Ortec/TLO_HLP_SSCR_VAR_DELETE which are called twice by ABAP form f_completed (see Figure 3).

Figure 3 ABAP source code

What is the problem here? What is the solution?

The solution for ABAP performance improvement

Based on tips from “SAP ST12 Trace – SQL performance analysis”, an expensive SQL statement can be reviewed from following area to see whether this can be improved –

  • Check database table access strategy.
  • Review identical access.
  • Review table buffer strategy.
  • Review number of SQL execution.
  • Review data returned.
  • Review table data volume management strategy.

In this particular case, we can see the reasons on why those 3 Delete SQL statements are so expensive based on Figure 1 and Figure 2 – it is due to over 3,600 executions. The each execution of sql is efficiency – in average, it took less than 1.2 ms/deletion (based on ABAP trace and SQL summary). There is no identical selection. Database is using the primary index to execute 3 SQL delete statements whose SQL where-clause match primary index. So this is not a table access strategy and identical selection issue. Based on business process, the corresponding table is changed very often, table buffering is not applicable. The table size is not relevant here as well since it is deleting record via primary key and table size is not big. So it looks like that we need to review number of SQL execution to see whether it can be consolidated.

When we review the source code (see Figure 3 above ), it is found that Functional Module has a simple task to delete tables using a single value and the “F_completed” form which calls the FM is called from a loop. The FM is a 3rd party code. So the proposal was given to developer to change the program logic to replace individual deletion with mass/array deletion.

Based on the above input, the program logic was changed. The ABAP form f_completed is not called in the original loop. Record for deletion is collected and stored in a program internal table in the original loop. The ABAP form is being rewritten and it is using mass database deletion based on the internal table instead of calling 3rd party code ( Figure 4 ).

Figure 4 ABAP mass deletion based on internal table

So what is the result of those changes?

The result of ABAP performance tuning

Following Figure 5, Figure 6 and Figure 7 shows when above changes was implemented in our production, job runtime and job table access performance comparison. In this particular case, this simple changes have made up to 98% performance improvement after changes was moved into production on May 25.

Figure 5 ABAP version management – show date when mass change is in

Figure 6 Job runtime comparison – before and after performance tuning

Figure 7 STAD – table access comparison before and after performance tuning

Figure 8 Monthly resource utilization under individual deletion

Figure 9 Monthly resource utilization under mass deletion

Figure 8 and Figure 9 show monthly resource utilization of the same program before the change and after the change – that is over 150 hours database time saving with the mass deletion.

Further clarification

Purely from runtime point view, the job can finished in 4 minutes prior to performance tuning. We tend to think that no much performance tuning opportunity exists for such faster program. This case told us that the fact that a program finishes faster does not mean the program is performance perfect unless it has been gone through performance testing or designed/coded by professional who is expert at performance. That a program runs faster might be due to lower volume or simple processing not attributed from sound performance design/code.

Reward from tuning a quick running program depends on frequency a SAP is executed. Priority of tuning a program depends on business performance requirement and resource footprint of the program. Performance tuning normally focus on a program which cannot meet business runtime requirement and it is using a lot of system resource.

Performance tuning can be an iterated process. If you would like to know more, please click SAP ABAP program performance tuning process.

Why would SAP job runtime suddenly jump well above normal range?

Business was complaining that runtime of a periodic SAP background job was suddenly jumped in production environment. There is no code change and no volume change and other jobs and transactions are running well. They were asking why this happened. I was consulted and looked into the situation. Finding the issue was related to Oracle execution plan change – Oracle chose a suboptimal execution plan. The plan changes was triggered by a regular table statistics update. Job run time was back to normal after plan was switched to better one following restoring previous table statistics. This blog would focus on how I trouble this particular SAP job performance issue. It would talk:

  • The performance issue – Job is running up to 10+ times longer than it used to be,
  • The performance trouble-shooting – why this job is long running?
  • Going deeper – Why is the execution plan not efficient?
  • Fix performance issue – action and result,
  • What we learn from this case? and
  • Further clarification.

Please continue read should you be interested in details.

1 The performance issue – Job is running up to 10+ times longer than it used to be

A job which runs every 15 minutes used to take less than 300 seconds finish was taking at least 600 seconds and up to 2,300 seconds to complete. Please refer to following screen shot to get more understanding on our issue.

Figure 1 SM37 job log – runtime history

Figure 1 shows that job run time had a jump since 00:15:02 on Apr 24 2015. There is no code change, no volume change and system has no resource contention… Why would the job run time have such a huge jump?

2 The performance trouble-shooting – why is this job long running now?

Here, performance trace was done on the job to find out where the time is spent by the program. The performance trace shows that job spent most of time on executing one select-SQL statement, and SQL summary for the trace indicates that number of records returned by the SQL is not high, this is unlikely a volume issue, so I examined the SQL execution plan – identify the hash join used by Oracle might be a bad choice, and checked SQL execution plan history and found that the timing of execution plan change is collated to timing when job started to have performance issue. Further I found that table statistics of one of the underlying table was updated immediately before the moment when the execution plan was changed. With that, I concluded that job runtime was due to table statistics update which led to reparse the SQL and Oracle changed SQL execution plan and used a suboptimal execution plan.

2.1 Where is time spent by the long running job?

Performance trace done showed that job was spending 99.9% of run time on database operations as below.

Figure 2 ST12 ABAP trace – why would job runtime jump

99.6% of time is on single SQL statement “Select VTTP” based on above trace. So this SQL is responsible for job long run time. Why was execution of this SQL taking so long ? ( Tried 3 times, can not upload my figure 2…)

2.2 Why is this SQL execution running long?

Is this due to inefficient index or due to a lot of database entries are retrieved? We need to check that with SQL trace. Following is SQL summary from SQL trace.

Figure 3 SQL summary – why would job runtime jump

From above trace, we know number of records fetched from the database by the top expensive SQL is “0”. So it is not due to a lot of records are fetched and the SQL is executed once. Is it due to wrong index used etc.?

Now let’s review the execution plan…

Figure 4 sql execution plan – why would job runtime jump

The index used looks ok in Figure 4…but when I compared number of records (0) returned by the SQL in Figure 3 and number of records(7,971) projected by Oracle during plan generation, I smell something wrong… This looks like an Oracle execution plan issue. Since this job was working well, this must be due to recent Oracle execution plan changes of this particular SQL. Is that true?

2.3 Was the SQL long running due to oracle execution plan changes?

 

Using Program /sdf/RSORADLD_NEW( there are SQL Scripts available for this purpose), I got SQL plan history as Figure 5 (right side).

Figure 5 TIming of job runtime change and timing of execution plan changes

Figure 5 shows that timing when execution plan was refreshed and timing when the job started to run long are closely related. It could be that Oracle reparsed the SQL statement and refreshed the sql execution prior to the sql was executed. But Figure 5 does not tell what the execution plan was used to execute the job prior to Apr 24 2015.

Following load history data shows that the execution of this SQL were captured on Apr 24

Figure 6 SQL load history – why would job run time jump

The job is executed every 15 minutes and every day. Yet the load history of this particular SQL was only captured on Apr 24. In our SAP system, we follow normal approach – only top expensive SQLs are captured into the snap-shot. So this means that this particular SQL was NOT one of top expensive SQLs prior to Apr 24.

2.4 Why would Oracle plan change execution plan?

Following screen shows that table VTTK statistics was updated at 00:08 on Apr 24, 2015. That timing is exactly linked to the timing when the execution plan was changed.

Figure 7 table statistics update – why would job runtime jump

 

There are other reasons which can lead to refreshing of Oracle execution plan. But this case, the execution plan refreshing is a result of Oracle table statistics update. And refreshing of Oracle execution plan led to a suboptimal plan in this case – leading to job runtime jump.

2.5 Conclusion on why job is long running

Job was running long after 00:15 on Apr 24 2015 was due to the fact that job is spending significant more time on executing one Select-SQL statement. Execution plan of this particular SQL was changed/refreshed at 00:16:53 in the same day. VTTK table statistics was updated a few minutes earlier at 00:08 in the same day, Table statistics update led to reparse related SQL statement – that is a typical Oracle response. New execution plan as a result of statistics update is not efficient as the original one. That made job run longer.

So why is the new plan not good and what is the better plan? I would try to take a closer look on the execution plan.

3 Going deeper – why is the execution plan not efficient?

In this section, I use sample values for the binding variables, number of table entries and st04 SQL cache data to validate the SQL execution plan which is in question.

In previous section, it is noticed that hash plan is used in several places. Hash operation is normally for data operation on two dataset when a lot of records are returned. Let’s take a peek on Oracle binding variable used in this SQL

Figure 8 Value for Oracle binding variable – why would job runtime jump

A4 is field vttk-shtyp and A5 is vttk-fbgst, following are total entries of VTTK and entry which is meeting selection criteria (Figure 7). Review table definition, FBGST is a field to store overall status of shipment and shipment stage. Value “A” for that field stands for “not processed” status. So the job looks like only interested in “new” shipment. Based on load history and SQL cache, we know that execution of this SQL returned “0” records most of times.

Figure 9 SQL cache – why would job runtime jump

Following screen shots indicates number of records from VTTK table which is meeting the selection criteria..

Figure 10 table size and number of records – why would job runtime jump

 

In the current execution plan, since hash operation is used, Oracle would prepare two dataset first prior to apply hash operation. When one data set has no or small number of records and the other data set can be read via efficient index, nested-loop operation is more efficient than hash join. So hash join operation at Step 7 in Figure 7 is not efficient as nested-loop. I did check the execution plan of this SQL in our testing environment. The plan was different from our production box. I put both plan side by side as Figure 11.

The plan in the testing box is better, when no records is returned from hash operation using VTTK and VTTP, the reading on lips would not executed via Nested loops operation.. so reading LIPS is spared.

Figure 11 SQL execution plan comparison – why would job runtime jump

Execution plan major difference is highlighted between test box and production box. It is mainly at step 4 where testing box is using nested loop while production box is using hash join. Due to hash join, production has to prepare datasets from LIPS and VTTP tables via another hash operation which is expensive based on selection criteria used and table size. While in the nested loop, step 9 and would be spared when there is no records. You might agree with me, If oracle replaces all Hash operation with nested-loop here, the performance could be even better.

4 Fix the performance issue – action and result

The table statistics was restored to the version prior to the problematic updating… after that, the execution plan in our production box is changed to the plan we are seeing in our testing box. The job performance is back to normalJ

Figure 12 Job runtime back to normal with restored execution plan

Figure 13 plan history after restoring statistics

Plan history shows that restoring statistics changed production plan to the same as what we are seeing in test box.

5 What do we learn from this case?

An unexpected SQL execution plan change could be very likely the reason that a job runtime jumps especially when following conditions meet

  • Other jobs or programs are running as normal.
  • Job spends most of time on database side or database time has a significant increase – STAD or ST03N,
  • Most of database time spent by the job is due to database read operation – STAD
  • There is no code change or job variant or volume change.

You can validate whether there is a plan change by checking plan history and

  • Check whether timing of plan change matches the moment when job runtime starts to jump

Oracle regenerate SQL execution plan when underling table statistics is updated. Table statistics updating is the typical reason which leads to an execution plan refreshing. You can validate the timing of statistics update and timing of execution plan change. SAP application performance issue related to table statistics change can be fixed by playing with table statistics to influence SQL execution plan.

6 Further clarification

Oracle SQL plan management can prevent unexpected plan changes to avoid SAP application performance issue due to accidental execution plan change. Many factors can influence a job run time like inappropriate code change, lock contention with other jobs etc.

In this particular case, a performance trace was done to understand the performance issue. But actually, there is no need to do performance trace to fix the performance issue. SAP tool STAD, ST04 and tools based on Oracle AWR should be enough to trouble-shoot this type of performance issue. I am writing another post on how to use SAP STAD. In that post, I would give some insight on how to get this doneJ

It would be nice for us to know the execution plan of this particular SQL used to execute the job before Apr 24 2015. Apparently this was not possible in our system. ST04 and Oracle AWR tools in our SAP system does not show us complete plan history. Oracle plan history table does not contain data to show us what the plan was before the issue. What could be the reasons? At this moment, I am guessing this might be related to our setting on how many SQL should be captured into AWR. Not quite sure of that.

 

SAP migration and performance – Would use of SAP MDS migration tool impact application performance?

My customer needs to migrate their regional ECC systems from Unix-based systems to Linux-based system. Some of regional systems are very big and their database sizes are over 10 Terabytes. To reduce system down time and mitigate risks of system migration cutover, the customer decided to use SAP’s Minimized Downtime Service (MDS) to migrate those ECC systems. I am the single point of performance contact for the migration project. If you are going to use SAP MDS to migrate your SAP system and wonder

  • Whether SAP MDS would impact SAP performance and how big is the impact?
  • What can do to mitigate negative performance impact of SAP MDS.

You might be interested in this post.

1 SAP MDS Brief

When a SAP system is down, it would not be able to process business transactions. So it is important for business operation to reduce migration downtime. SAP MDS is used to meet such customer need by migrating majority of data if not all from source system to targeted system while the source system is under normal business operation. In a high-level, SAP MDS would take a clean database copy while SAP system is up, It would take some time for SAP migration tool to migrate the database copy to targeted system, meanwhile, “new” changes (new records or changes on existing record) from ongoing business operation are generating, new changes are captured via database triggers, MDS would transfer those changes to the targeted system after completing migration of initial copy(clone) – this is called “synchronization”. In this way, when we need to switch from old system(source system) to new system(target system), number of database records needed to migrate to target system is very minimal if not all records are transferred, targeted synchronization level is 99% before downtime, so downtime needed to migrate a sap system from one environment to another environment is greatly reduced as well as associated risk. SAP MDS is an incremental migration solution.

MDS is using database trigger and log tables to capture incremental changes. MDS solution would create a log table and create a table trigger for each database table by default. Changes(Delta) on a table are captured via database trigger and recorded in its’ respective log table for later synchronization.

2 Would SAP MDS impact sap system/application performance?

It is stated that MDS would impact the Oracle overall performance in a minimal way. But in my experience, we do see negative performance impact both technically and end-business impact after SAP MDS is enabled in our system.

2.1 SAP MDS – negative impact on system performance

Based on database performance data technical analysis, MDS solution does has impact on overall system performance. Database triggers were enabled in our production system on Sep 6 as part of MDS solution. Following chart is showing weekly average response time for each sequential read and change(micro seconds/operation). It clearly shows that there is a database performance deterioration after MDS is active.

How would this impact business transaction run time?

2.2 SAP MDS – Negative impact on application performance

Following chart is screen copies of transaction profile from two weeks – Week of Aug 11 – 17(left) when SAP MDS was not enabled and week of Sep 8 – 14(right) which MDS was active.

From above chart, you can clearly see that database time per transaction step in MDS week is noticeably higher for top 12 reports/transactions – this contributed to longer response time per transaction step for all top 12 reports/transactions except for transaction VA01 and ZVOMH. For VA01, average database time is higher in MDS week but offset by much lower CPU time which leads to better response time in MDS week. Better performance for ZVOMH in MDS week might be due to much lower business volume – CPU time per transaction step in MDS weeks is 404 ms/step which is much smaller than 878 ms/step pre-MDS week.

In average, response time per transaction step is 10-50% higher in MDS week. Impact on individual jobs/program would vary since each program has different function, design and code.

You can use SAP transaction ST03N to do application performance comparison analysis such prior upgrading and after upgrading etc. You can use SAP ST03N transaction to get the transaction profile showed as above.

2.3 SAP MDS – Negative impact on individual background jobs

If a program/transaction runs 10-50% longer, would this impact business operation? This depends on gap between business performance requirement and program/job performance position. For example, if requirement is that a job has to finish in 1 hour and the job max duration is 10 minutes prior to enable MDS… Then even the job takes up to 59 minutes in MDS, it is just a technical impact on job runtime which has no business impact. Business might not even notice this runtime increasing.

Following table shows average run time and maximum run time for a list of daily jobs (they are executed once daily) in two periods of two weeks from our production environment..

Job Average Run-time (sec)

Maximum Runtime(sec)

Business background Job Pre-MDS (2 weeks) Post-MDS(2 weeks) Pre-MDS (2 weeks) Post-MDS(2 weeks)
R_XN_ON_IDOC_POST 1,207 1,459 1,802 1,871
PP_2SKU1_SPLIT1 467 624 1,552 2,877
PP_MPSMRP_NA 5,851 7,642 6,364 9,443
PP_ACTION_CNTL_FR_MRP 1,247 1,417 1,300 1,604
R_NA_BWINV_DFC_CLEAR 1,204 3,016 1,324 9,016
R_NA_BWINV_DFC_01 2,782 5,867 3,974 8,261

Above table shows that SAP MDS would make a job average run time 13% – 150% longer and make maximum job run time about 6 times longer! Jobs in the above table run in sequence – they are critical path of a job chain. Business has a cut-off deadline for the last job at the last row of above table. Since each preceding job was running longer after SAP MDS was active, accumulated effect on the last job is big – this resulted in a deadly miss to the cut-off time.

So now, you have seen data that SAP MDS can impact performance, how can we avoid this? What can we do to mitigate the negative performance impact from SAP’s MDS?

3 How can avoid SAP MDS’s negative impact on SAP performance?

It is stated that SAP MDS has to create database trigger for catching the changes and log tables for storing the changes. Each change would be stored in log table in addition to normal transaction table. And MDS would need to read the log table and transfer it to the target server (synchronization) via parallel processes.

MDS activities would use some amount of system resource. Make sure that your system has enough system resource (CPU, Memory, Disc space, network bandwidth) and sufficient SAP work processes to take care of MDS activities.

Prior to MDS, we should tune system setting to cate for additional objects and load from MDS:

  1. SAP memory tuning such as SAP Nametab buffer etc. catering for additional ABAP table object,
  2. Database buffer and redo/log space for additional table object and additional reading/updating on log tables,
  3. Identify top changes tables and their jobs,
  4. Remove the top changed tables from MDS solution – especially number of changes is high and table size is small. You do not want to remove a huge table from MDS solution – it might need more downtime to copy it over – against objective of SAP MDS,
  5. Design parallel process schedule used in SAP MDS data synchronization schedule according to system load. Changes are transferred to targeted system by using parallel processes. Number of parallel processes used in data synchronization can impact application performance. It is better to use less number of parallel processes in performance critical period while more parallel processes can be used in data transfer in non-critical period. For example, SAP MDS is allowed to use up to 10 parallel processes for data synchronization in critical period. 35 parallel processes in business hours. In other period, we allow MDS to use 45 parallel processes and
  6. Hourly monitoring was also built to monitor system based on pre-established performance KPI like log file synchronization etc.. So SAP MDS activities can be adjusted timely to avoid negative impact on performance.

From application point view:

  1. Review critical jobs and their performance position – identify potential victim of MDS activities,
  2. Take action to address potential performance concern on potential victim before SAP MDS is enabled in your system:
    1. Schedule change – start it earlier,
    2. Remove un-necessary dependency,
    3. Speed the process by breaking down the original volume and engaging parallel processing such as multiple jobs,
  3. Suspend non-critical jobs which are doing massive database changes and
  4. Monitor critical application performance closely after SAP MDS is active so potential performance issue can be detected earlier for action.

In the above case, the performance issue was reported in mock migration. We addressed the issue by removing one step out of chain since analysis indicates there is no absolute business need to build that job into the job chain. We also split the last step job into 3 jobs. Disable table trigger on the most-changed table by the program, redesign plan of parallel processes needed in SAP MDS data synchronization stage. …With those changes, the performance issue was addressed and the job was able to finish prior to the cut-off time during the 2nd migration mock as well as migration phase.

SAP System Log Review – SAP Terminal in status Disc

Recently I was involved in reducing SAP SM21 system log. One of top messages in our SAP SM21 log is “Terminal ##### in status DISC” & “Delete session ### after error ###”. About 20 such messages were generated hourly under a particular user. I looked into this and fixed the issue. I also looked into other SM21 log messages – “canceled transaction” and perform “rollback”. In this blog I would share my understanding on those errors and how those errors are addressed.

  • SAP SM21 log – “Terminal ##### in status DISC” & “Delete session ### after error ###” ,
  • SAP SM21 log – Perform Rollback ” “Canceled transaction” and
  • SAP SM21 log – “Canceled transaction”.

1 SAP SM21 log – “Terminal ##### in status DISC” & “Delete session ### after error ###”

1.1 Overview of investigating “Terminal in status DISC” & “Delete session after error ###”

“Terminal in status Disc” means a broken connection between SAP application servers and the frond end like SAP GUI etc. “Delete session after error ####” in this case, the error is 023. The error means execution of database operation is terminated before it can complete normally. In our case, HP quality center is used to monitor system status. VuGen scripts are executed from several servers to logon SAP to simulate online user transactions. The solution is to make sure that script is executed sequentially instead of parallel and let the script to wait for completion of SAP transaction before it exits from SAP. And that solution has fixed the issue. Details are covered in following sections

1.2 The SM21 log

Following is a combination of SM21 SAP system log screens and STAD transaction statistics screen for the same period – 01:00 – 02:00. Both screens are truncated. The SAP STAD is clearly showing that SAP transaction VA03 was repeatedly executed. 1st “Terminal 00397 in status DISC” was logged at 01:07:34. Prior to that, there were several VA03 executions without SM21 log message.

Figure 1 SAP SM21 Terminal in status DISC

Further reviews in other hours indicated that this SM21 log happened hourly but within an hour, it occurred randomly and logs are related to 5 different terminals.

“Terminal in status DISC” and “Delete session 001 after error 04” is one pair of message. “Deletion session 001 after error 23” comes alone. Please refer to figure 2 for sample.

Figure 2 SAP SM21 Delete session after error 023

1.3 What do SM21 log -Terminal in status DISC and Delete session 001 after error 023 mean?

Terminal in status DISC means that the SAP system is trying to send data to the terminal(client) but the terminal is DISConnected from SAP. This could be many reasons. Analysis of statistics records (SAP STAD) under user HPMONITOR indicated that va03 transaction was executed from several clients/terminals. Owner of this execution confirmed those va03 transactions are executed via VuGen(Virtual User Generator) scripts of LoadRunner. Analysis of statistics records indicated there was concurrent execution of those VuGen scripts. All execution of VuGen Scripts are using the same SAP dialog user account. All those scripts logon SAP under the same dialog user. SAP user license agreement does not allow sharing user account – or the same userID logons to SAP system twice at the same time. I think that concurrent logon via VuGen Script from different PC could contribute to our issue. Based on that, we re-arranged the execution of Vugen scripts to ensure that only one script at one time from one location is executed.

With above changes, the Terminal in status DISC log did not happen hourly any more, but it did happen occasionally on a daily basis. We had another error message “Delete session 001 after error 023” which was then number 1 message logged under the user HPMONITOR. SAP message on error 23 said “process restarted”.. really does not give me any direction, so I reviewed the developer trace. Following is the screen shot of Error message log and developer trace (access it via SAP transactionSM50).

Figure 3 Delete session after error 023 and SQL error 1013

From developer trace, we could see that oracle error 1013 is logged with the “Delete session 001 after error 023”. SQL error 1013 happens due to “user requested cancel”. ORA-01013 is “user requested cancel of current operation, This forces the current operation to end”. There is no cancellation command in the script and message was logged randomly with the execution of the same script. Why would execution of the script have such behavior?

I think this might be due to the fact that the script was sending a logoff signal to SAP system from the script while operation in SAP had not completed, Based on system status, the same operation in SAP could be faster or slower. The logoff signal from the script can arrive SAP faster and slower depends on network traffic.. so it looks like to me – this happened when logoff signal from the script arrived SAP prior to completion of previous step. It should help to fix this to add some wait in the Vugen Script prior to logoff. So the VA03 Vugen Script was modified by adding a few seconds wait prior to step of logoff.

1.4 Result

With above changes, the execution of those scripts generates no SM21 log any more in our system. As what is showed in following screen shot of combining of SM21 log( no logs under HPMONITOR) with execution of those scripts(STAD).

Figure 4 SM21 – Terminal in DISC – issue fixed.

1.5 How to reproduce SM21 message log – “delete session after error 023”

Logon SAP via SAP GUI from your Microsoft PC, Run se16 against a big table querying data without using key/index field, while it was still querying data, terminate your SAP GUI via window task manager, then you would reproduce the same SM21 log and developer trace showing SQL error 1013 like what is showed in Figure 5.

Figure 5 SM21 – simulate SQL 1013 error

Alternatively, I believe I can simulate the message “Deletion session ### after error 023” by terminating a running SAP work process in middle of SQL operation etc.

There are similar SM21 messages like “Connection to user, terminal lost”. The root cause is normally outside of a SAP system but related to specific user/program behaviors.

2 SM21 log – Perform Rollback

Here I would show how to trace back to the job/program which is producing SM21 log – “Perform rollback” or how you can find parent of SM21 message “Perform rollback” via an example.

Steps used to find the SM21 log’s parent — SM21 -> /sdf/mon -> SM37 -> WE02/WE05

You can use information timing, User, work process number , server to search job/program name in performance snap-shots generated via SAP transaction /SDF/MON…as showed in Figure 6.

Figure 6 SM21 log and /SDF/MON – Job which generated “Perform rollback” log

 

Then you can use SM37 to find specific job as showed in Figure 7.

Figure 7 SM37 Job details

Then you can check job log or job spool depends on situation from SM37. Here is the job spool

Figure 8 SAP job spool

Figure 9 Job spool details

From figure 9, you can find the idoc 0000001390670257 is in “51” status and not posted. Using WE02/05 or se16, you can check the IDOC status message, you can see that IDOC was processed again and again, each time it was issuing “perform rollback”. Based on IDOC, you can find corresponding function module which is used to load the IDOC..

Figure 10 Idoc status – technical info

Figure 11 Idoc was repeatedly processed

A great book on SAP performance – SAP Performance Optimization Guide: Analyzing and Tuning SAP Systems, SAP Basis, SAP Administration

3 SM21 – transaction canceled

Transaction can be canceled due to many reasons like program code (like “A” type message instead of “E” type), SQL error (like DBIF_RSQL_SQL_ERROR etc.), memory issue ( like System_No-Roll etc ), manual cancellation and so on.. Following are screens showing “transaction Canceled” message.

Figure 12 SM21 – transaction canceled

If you would like to know how to trace it back to know what job and program is related to a particular line of “Transaction Canceled” and there is no core-dump. Here are the steps:

SM21 (Timing, user, SAP work process, server) -> /SDF/MON ( timing, server job name, work process) -> SM37 ( verify job status )

Figure 13 SM21- Transaction canceled and its’ parent

When there is a core-dump related to SM21 log, you can use SAP transaction ST22 to find the parent of a SM21 log. With job name, timing, server name and work process number, it is possible to associate a SM12 log message to a specific job execution via SAP transaction SM37. Figure 14 is the specific job instances related to highlighted entry in Figure 13.

Figure 14 SAP SM37 JOB details

“No deliveriy items found” message showed up in the job log (Figure 15) and message type is “A”. So that is why the message went to SM21 log. So this is an application issue.

Figure 15 SM37 job cancelled under SAP message type “A”

Figure 16 shows that whenever the job was canceled, a related message was logged by the SAP system.

Figure 16 SM21 log and SM37 Job

From Figure 16, we knew that the job was executed every 20 minutes and it almost failed every time. Apparently, this was an application issue. In further analysis, I found that this was due to the same problematic IDOC. To get the message disappear from SM21 log, it is a simple fix but that would not solve the data issue as well as the program design how to solve exceptional case. From job setting point view, the frequent job cancellation can be avoided by excluding problematic IDOC from being processed each time.

In this post, I shared several cases on how to review SM21 message log and trace it back to its’ originator and solution to fix this. Sometimes, the SM21 message are logged even everything went Normal in end business view( in our case HPMONITOR ), but most of cases, users should notice something abnormal or can be monitored from application point view like SAP SM37 job cancellation. Monitoring job cancellation via SM37 and fixing job cancellation could make corresponding SM21 log disappear if there is any. So I am not suggesting that we should start with SM21 to fix an issue. The preferable starting point is that application should monitor health of their operation closely and fix any issue properly… Even there is SM12 log related to an operation, but it could be all right or nothing worth to be worry about normally.


SAP Performance issue from SAP server upgrading

There were performance issues in production system– longer runtime of several business transactions and Server overutilization after SAP application servers were upgraded even the upgraded system has higher capacity measured in SAPS and new servers are faster. Testing done in investigation shows that a flaw/limitation in SAP system (SAP kernel) is the culprit. The flaw/limitation in SAP lies in accessing generic buffered table without primary key. This flaw/limitation leads to exponential growth of CPU time needed in accessing generic buffered table when number of concurrent execution of the inefficient SQL statements reading generic buffered tables grows on single server/instance in our current SAP environment. In the SAP upgrading project, number of SAP application instances was reduced due to more powerful servers. This reduction led more concurrent activities on single SAP server than what it was prior to the upgrading so is the number of concurrent execution of the inefficient SQL statements. So each program execution was consuming significantly more CPU time in post SAP server upgrading, this led to negative end user performance experience of such transaction in post upgrading period and quickly consumed additional capacity brought by the upgrading project. Server overutilization in return was having negative impact on performance as well.

Here I would share how we reach this decision –

  1. SAP Upgrading and performance issue,
  2. Investigation Summary,
  3. Relation between job run time and number of concurrent jobs,
  4. Is this generic buffer access an OS or SAP kernel issue? and
  5. Solution.

1 SAP Upgrading and performance issue

Our customer had upgraded their SAP application servers. After upgrading, number of SAP servers/instances is reduced to 3 from 5 but sum of SAPS of new SAP application servers is higher. With new server, we have more capacity and new SAP server is faster. So we are all expecting to better application performance.

However SAP application performance are mixed after upgrading – many transactions are seeing better performance(reduced runtime) as expected but several daily transactions/programs executed by many SAP users online and scheduled in many background jobs were often( not always or every time) seeing much longer execution time without any other changes such as code and volume etc. Further, the system was even more heavily loaded than it used to be even with our extra capacity brought by the upgrading. This was a surprise to everyone – Why?

2 Investigation summary

I investigated the upgrading performance issue. I did work load analysis and confirmed that some programs/transactions were indeed more expensive in terms of average response time and CPU time in post upgrading period than the period prior to upgrading (SAP ST03N transaction can help this) and most of “extra” time went to CPU side. Then Job runtime statistics records showed that program could spend over 90% of on CPU side (SAP STAD transaction can help you this). ABAP trace showed that over 80% of run time was mainly spent in a SQL statement accessing a generic buffered SAP table (SAP transaction ST12 can give you a run time distribution among different program components and ABAP statements, SAP SE11 can tell you table technical setting – buffer or not etc.). Checking the SQL statement and the table definition revealed that the program was not coded properly to read a generic buffered table which has over half million entries – primary key was not specified in the SQL where-clause. The SQL is coded in an ABAP functional module which is called by those impacted transactions and programs. I found there were over hundreds of business users executing those transactions/programs daily ( SAP ST03 transaction can help you on this) and it is also scheduled to run many background jobs ( you can find job related to a SAP program via SM37 or SE16 or ST03N). Number of current execution of the impacted program at instance/sever level were higher since number of servers are reduced (SAP transaction /SDF/MON can be used to validate this). Analysis pointed that the generic buffer access issue existed prior to upgrading and reduction of servers/application instances resulted in a higher number of concurrent executing of the bad SQL statement in a single server. When number of concurrent execution of the bad SQL statement was higher and performance of the program was worse consuming much more CPU power to finish the same amount of business work. That leads to longer run time of applications and an overloaded system.

I designed testing cases to validate this and my testing cases does approves above observation – Job run time(mainly CPU time) would grow exponentially when number of concurrent execution of the inefficient SQL statement went up. Further cases were designed to understand the issue whether it is at OS level or at SAP application level. My testing showed that this issue is more related to SAP kernel other than OS (Operating System software) and this issue would happen when “opportunity” is right:

  • The program is accessing a table without primary key,
  • The program would repeatedly access the table, namely access it in a loop,
  • A table is generic buffered,
  • The table is not “small” – over thousands of records and
  • The program would be executed currently by many users or background jobs on a single server.

Based on the above, solution is proposed and adopted for the remaining SAP system upgrading to mitigate negative performance impact on user’s experience arising from physical server reduction. More would be covered in following sections.

3 Generic buffer testing – relation between job run time and number of concurrent jobs

I used a report tool – ZKXXPROFITCENTER in our environment to validate how concurrent executing of inefficient buffer access can impact a job run time and CPU utilization.

The ZKXXPROFITCENTER is to report profit center for a material/product. The profit center is stored in a local table which is fully buffered and has about half million of records. The program has the following steps

  1. Get a list of materials/products based on selected plants ( over 50K materials selected),
  2. For each material in the list, read the fully buffered table to get new profit center. The program is not using the primary key to read the buffered table and
  3. Report the result.

The program spent most of time on the step 2 and on CPU side. So the program is a perfect fit for the needed testing.

My testing is to check how program would respond with increased concurrent execution of the program. Since online execution is normally subject to timeout, I scheduled the testing cases via background job – one job at a time, then 3 jobs with the same time, followed by 6 jobs at the same time. Each job is with the same program and one common variant to ensure there is no volume difference among those executions. I also specified the same target application server for all jobs so concurrent jobs can be executed on the same server. After testing, I recorded job response time, CPU time and database time, and calculated their average runtime, CPU time and database time. Following is the chart showing the result.

Figure 1 Relation between Number of Concurrent buffer access and their job runtime

Above the chart clearly show the “exponential growth of response time, CPU time” when number of parallel jobs in the same server was increased. The database time portion is not significant and the database time is quite stable (between 3 jobs and 6 jobs). Following is the run time statistics of jobs which above chart is based on. The job volume specified under column “Transferred Kbytes” is the almost same for all jobs – this is expected due to the same program and same variant.

Figure 2 Concurrent job and their time statistics

With reduced servers and application servers, users and jobs are more condensed from single server point view since upgrading would not change end business user behavior and job scheduling- number of users and number execution of the program at the system level should not be changed, so each execution of the same program now would run longer in spending more time in CPU side. This indicates our issue is related to a bug/limitation in system (SAP software/OS software) which is triggered by the SQL statement reading a table without primary key.  That was the reason there was CPU overutilization after migration even we have more SAPS.  Higher level of concurrent activities on single server eats up extra server capacity (CPU).

But is this limitation on SAP side or Operating System side? SAP was contacted on this issue, SAP position that this is a customer code issue and buffer swaps- no our system log showed that buffer invalidation /swap was not concern here. We got OS expert to debug this – The OS expert’s data showed the issue might be related to Operating system. How can we validate this?

4 Is generic buffer accessing a SAP kernel issue or OS kernel issue?

Now we are trying to understand whether this is more like SAP kernel issue or OS issue. We built two SAP instances on one physical server in our testing environment. Our reasoning is we should see similar job run time when the program –ZKXXPROFITCENTER with the same variant is executed on two instances at the same time if this is an OS level issue, since both instances are sharing the same physical server (CPU) and connected to the same database.

Based on that idea, I scheduled two background jobs with the same program and variant used in the above testing – assigned one job to one instance, the other job to the other instance. Following (Figure 3) is the STAD report on the result:

Figure 3 Job runtime comparisons between two instances on the same physical server

Above screen shot(Figure 3) shows that even two identical jobs were running at the same time in the same physical server talking to the same database server– two jobs’s performance is materially/huge different instead of having similar runtime and similar CPU consumption. The longer the response time is, the bigger portion of CPU time is. You could also notice that run time for two jobs run at 01:00:47 is much closer than other pairs. This is because both instances had lower activities at that time. When other activities involving generic buffer access increased, the job ran longer. Job ran much longer on A_04 instance than on A_10 instance. Because more parallel work processes was configured on A_04 instances to process CIF order updating activities – order updating has accessed to SAP pricing tables which are normally buffered. In all above testing, the physical server was not overloaded. You can tell this as well based on time difference between CPU and Time in WPs from the above STAD screen shot.

So this seems to me that this is a SAP kernel bug/limitation and instead of operating system bug.

Someone might wonder whether reliabilities of above tests can be impacted by buffer swaps and invalidations. In all testing, I checked SAP buffer swap/invalidation was not a concern. This can be validated via SAP transaction ST10 – table call statistics.

5 Conclusion Solution

Now I believe you agree with me on the points stated before about when the issue would happen:

  • The program is accessing a table without primary key,
  • The program would repeatedly access the table, namely access it in a loop,
  • The table is generic buffered,
  • The table is not “small” – over thousands of records and
  • The program would be executed currently by many users or background jobs on a single server.

I think we can agree that relation between total activities in a system and number of activities on an instance can be expressed as:

    Activities on single SAP instances = Total system activities / Number-of-SAP-instances

This is a simplified version and load on a single instance can be impacted by many factors. But this simplified version would not impact our discussion here. When total system activities is the same (upgrading should not change business activities and their pattern), higher number-of-SAP-instances would lead to lower activities on a single SAP instances. In our SAP upgrading, we have reduced number of SAP instances from 5 to 3, this was 40% reduction and leads to a 40% increased activities on a single server basis… Given the existing SAP Kernel flaw/limitation related to generic buffer access, this answered why there were application performance issue and server overutilization after upgrading. So you would agree with me:

  • The issue is at SAP instance level instead of at physical server level,
  • The issue is more related to SAP kernel instead of OS kernel and
  • Reducing number of SAP application instances in SAP upgrading could be a potential risk until the flaw/limitation in SAP kernel is fixed

So what is the solution? It looks like

  • SAP fixes this by looking into this further on why concurrent buffer access would need significant more CPU power.
  • Application solution teams fix this by identifying such code and changing code or table setting to avoid running into the kernel flaw.
  • SAP upgrading team maintains similar number of SAP servers/instances as a SAP system is having prior to upgrading.

It would take time for both SAP and application team to fix this. Both are not in control of SAP server upgrading team. Bottom-line, It is expected that upgrading should not at least have negative performance impact on business programs/transactions. So from that point view, maintaining the similar number of SAP servers/instances is the best approach to mitigate the generic buffer access performance issue triggered by physical server reduction. In our upgrading project, we decide to build more than one SAP instance on one physical server to maintain the like number of SAP instance. Using this approach, we have upgraded a big SAP order fulfill system with reduced number of physical servers smoothly.

6 Further information

Following is the piece of code and table involved in above testing cases.

Figure 4 problem code and info on fully buffered table

Actually, the above code is one of two pieces of codes which were related to several transactions and programs whose performance were badly impacted by the SAP upgrading. To fix the issue related to above situation, we simply removed table buffering since there is a good index on OLD_PRCTR already. Of course, adding bypassing buffer statement into above SQL can fix the issue as well. The other piece of code is fixed via code change so primary key is used to read the buffered table. Based on above understanding, two identical instances were built on each new server to stay in safety side. With all those changes, we put the upgrading issue at the rest.

We would prefer to stay simple – one instance on one server – that was the original plan, we might stay that plan if we do not run into the generic buffer accessing issue related to inefficient SQL code in SAP upgrading.

SAP ABAP Program – reading an internal table with binary search vs without binary search

In my SAP experience, I often find that bad SAP job/program performance is due to poor program internal table handling. This is especially true when most of program/job run time is spent on CPU/ABAP side. One of common performance pitfalls is that internal table reading is not efficient. Read an internal table using binary search has much better performance than reading it without binary search when number of entry reaches a certain limit (hundreds). Here is a perfect SAP example to show performance difference which binary search is making – it is 32,441 microseconds against 10 microseconds per read. This post would go through this case covering following topic.

  1. Huge performance difference on two “read table statements” on the same table
  2. Why does two “read table statements” have huge performance difference?
  3. Why could not two “read table statements” have similar performance?

1 Huge performance difference on two read table statements

SAP report /SAPAPO/SAPLCIF_DELTA3 is a top expensive program in our system based on SAP ST03N workload analysis. We are analyzing why the program is expensive via ST12 trace. Following is part of our trace

Figure 1 SAP ST12 trace on /SAPAPO/SAPLCIF_DELTA3

Figure 1 trace shows that tracing on the program lasted 4,198 seconds. All time is spent on ABAP side with “zero” database and system time. When you review the breakdown of ABAP time, you would find that there are two lines of “Read Table LT_R3_Order”. The 1st line of them spent 4,185 seconds to read the table 129,007 times and the 2nd line of them spent 1 seconds to read the table 129,014 times. Reading the same table with almost the same number of execution from the same program, why performance is so difference?

You can refer to my post how to execute ST12 trace and how to analyze ST12 trace for details if you are interested.

2 Why does two “read table statements” have huge performance difference?

To answer that question, let’s review the corresponding code. Figure 2 is the ABAP code associated with 2 “Read Table LT_R3_Order” showed in Figure 1. Reviewing the code, we can see that the faster reading is using the key “ordid” field via binary search. The slow read is reading the table with key “Ordnr” and “Ordtype” without binary search.

Figure 2 ABAP code – with or without Binary search

3 Why could not two “read table statements” have similar performance?

If you know how binary search, you would know that there is precondition to use binary search, a table must be sorted according to the key used in reading the table. If you are going to read an internal table via different keys at different places, you need to sort the table according to the keys before the reading. Sorting a table needs time. Sorting a table back and forth within a loop could make performance worse… Following is SAP “embedded” explanation on above the situation and two “read table” statement showed above is within the loop.

Since there there are multiple read tablestatements on lt_r3_order with different keys,”so we can use binary search only for the one key combination by which we sort.” So sorry for other read on the table, you got to be suffer – that is what SAP is telling us in this case.

Do you agree with SAP statement? In this case, our main concern is to cut CPU/runtime. I think there might be options to improve the situation. And if it works, what took 4,198 seconds to complete might be able to finish in 1 minutes if not a couple of seconds. If you are interested in discussing this, please free to send me an email.

 

SAP SBWP workflow item deletion performance issue

The performance issue is that our SAP users were taking minutes ( some time even more than half hour) instead of seconds to delete a workflow item via SAP transaction SBWP. What was making SAP SBWP to take that long to delete a SAP workflow item and what we did to fix this performance issue? This post would answer those questions covering:

  • The symptoms of SBWP deletion performance issue in SM50, STAD and ST12,
  • Root cause of SAP SBWP deletion Enqueue contention and
  • Solution to fix SAP SBWP deletion performance issue and Result.

In summary, we turned off logging of the SAP office document deletion activities via SAP transaction SO16 and our performance issue is resolved immediately.

1 The symptoms of SBWP deletion performance issue

Here I would share the symptoms we were seeing with the performance issue.

1.1 SAP work process monitor – process stuck in SAP report “SAPLSENA”

Start SAP transaction SM50 and refreshing your SM50 screen, you can see that the impacted user is stuck with SAP report “SAPLSENA” as showed in figure-1. You can see many business processes were impacted.

Figure 1 SAP SBWP – SAPLSENA issue with many business processes

Figure 2 is a SM50 process (PID 11676) snapshots related to a single user *133 who were experiencing the performance issue. This shows ENQUEUE contention issue impacted a run time of a single process due to stuck on report SAPLSENA. SAPLSENA is executed to request a lock on a business data object. You do not normally observe this program in SAP work process monitor SM50 unless there is a lock contention or you are trying to execute the program repeatedly in a loop statement to lock many objects since one lock request can normally finished in a few milliseconds.

Figure 2 SAP SBWP deletion performance issue – single work process stuck on SAPLSENA report

1.2 SAP ST12 Trace – SBWP spent most of time in executing FM Enqueue_E_APPL_LOG

SBWP deletion process for the user *133 mentioned in Figure 2 was traced. Following is the ST12 trace. The trace is clearly showing that runtime was spent on enqueue operation with SAP program SAPLSENA and lock object E_APPL_LOG.

Figure 3 SAP SWP deletion performance issue – ST12 trace

Figure 4 enqueue trace tells us that contention is on SAP Lock object E_APPL_LOG.

Figure 4 SAP SBWP deletion – E_APPL_LOG contention

1.4 SAP STAD statistical records – SBWP spent most of runtime in Enqueue operation

Figure 5 STAD time distribution screen is related to above SBWP deletion operation. It clearly shows that most of response time in executing SBWP deletion is spent on Enqueue operation. The operation is to delete one workflow item but it executed 73,370 time of Enqueue operation. This indicates repeated requests to lock an object due to previous failures in securing a lock on an application object. SAP STAD can tell you where the time is spent when you have performance issue. This is critical to SAP application performance analysis.

Figure 5 SAP SBWP deletion performance issue – Enqueue time in Stad

Please note that SM12 showed a higher than normal lock rejection rate during the period we were having SBWP deletion performance issue.

2 Root cause of SAP enqueue contention

So the performance issue is due to SAP enqueue contention on SAP lock object E_APPL_LOG. When a user was trying to run SBWP to delete a document, it was requesting to get a lock on E_APPL_LOG object but that business object was locking by someone else and the program was keeping requesting lock until lock was successful… so the deletion was taking much more time. But what was leading to the Enqueue contention on the SAP lock object?

2.1 Search SAP OSS not for SBWP deletion performance issue

Since this is a standard SAP transaction, SAP OSS site was searched. One SAP OSS note “766112 – SAPoffice: Folders or documents disappear” is talking about enable deletion of log to audit the deletion activities. It mentions that SAP transaction SO16 can be used to turn on the deletion log after related OSS notes are implemented.

Side effect of the OSS note is: “Since the system saves the deletion activities of all users in a log, this log must be locked during the deletion activity. This means that only one user at a time can delete documents. Other users must wait until they are assigned to lock the log. This may lead to a performance deterioration during the deletion activity. The lock is set automatically.”

Is the deletion activities logged in our system?

2.2 Our system setting SBWP deletion log setting

SAP transaction SO16 can be used to review and maintain the log setting, we had following setting in our SAP system when we were having performance issue. So deletion log was enabled in our system.

Figure 6 SAP SBWP deletion performance issue – Log Deletion of SAP office Documents

SAP workflow function is heavily used in our customer system. Many users and many processes in the system was queueing to get the lock in order to proceed, no wonder many users were reporting performance issues even they were not running SAP SBWP transaction to delete a workflow item.

3 Solution to SAP SBWP deletion performance issue and result

So the solution is to uncheck “Log Deletion” and do not log deletion activities in SBWP. SAP SO16 transaction is used to disable the log. Upon unchecking, SBWP deletion performance issue is fixed immediately and all symptoms disappeared.

Root cause of the issue is that someone turned on “Log deletion of SAP office document” via SO16 transaction. We should not log deletion activities in business workplace via SAP transaction SO16 in a busy SAP system.

Performance issue related to program SAPLSENA issue is an application performance issue caused by contention of securing a SAP application lock related to a business object. Please pay attention that this is not a database level lock.