Oracle World: How to Analyze High CPU Utilization In Solaris

Visit the Below Website to access unlimited exam questions for all IT vendors and Get Oracle Certifications for FREE
http://www.free-online-exams.com

Problem: How to Analyze High CPU Utilization In Solaris

Solution:

Kernel Lock contention:
• smtx (spins on a Mutex lock, as reported by mpstat(1)),
• srw (spins on a reader/writer lock, as reported by mpstat(1))
Locks are used for protecting critical kernel data structure from simultaneous access. There is always a lock needed for shared writable area. In Solaris thread sleeps at any one of the following synchronization objects:
• Mutex: It serializes access to critical kernel data structures.
• Condition variables: Implements kernel sleep/wakeup facility.
• Semaphores: Synchronize access to a shareable resource.
• RW lock: Concurrent access to reader, Exclusive access to writer
Mutex locks are most commonly used in Solaris[TM] kernel to protect critical data structures. Smtx event indicates that a kernel thread is spinning waiting for a mutex. The thread that is holding the lock is running on some other CPU, and the thread trying to get the lock will spin on the cpu expecting that the mutex owner would release the lock soon. Smtx represents an entry point into the adaptive mutex code.
In some cases, we can distinguish between reading a piece of data and modifying a piece of data. For such cases, kernel uses reader/writer locks(rwlock). Contention on the rwlock is reported by "srw" counter. Any number of threads can hold rwlock for reading, as long as no thread holds the rwlock for writing.
There are thousands of locks in the kernel, that all get acquired through this same entry point. There's no way to know just by looking at the smtx or srw count that the count refers to an attempts to get the same lock, the same 10 locks or 100 different locks. To find out what kernel resource(s) are seeing contention, use lockstat.
To record kernel contention lock events, such as spin on adoptive mutex or block on rwlock type:
#lockstat -o lockstat.out -C -i 997 -s10 -D20 -n 1000000 sleep
It is recommended to run lockstat(1M) for a shorter interval due to overhead of logging the events of interest by modifying the running kernel.
See Document: 1005868.1: A Primer on Lockstat, and lockstat(1M) man page for detail.
High Pageing Activity:
• minf/mjf(mpstat: minor fault/major fault), mf (vmstat: minor fault)
A page fault occurs when a process attempts to access data that it currently does not have mapped. It is considered a minor fault if the data is in memory and the only thing required is the creation of a new mapping for it. A major fault occurs when the page is not in memory, and has to be created, and possibly populated with data from the backing store, then the mapping built. For example, a process maps in the "libc.so" library in its address space and makes a reference to a page within it. A page fault occurs, however the physical page may already be present in the memory, might have been brought in by some other process earlier, then the kernel only needs to establish a virtual to physical mapping without performing an expensive I/O operation.
Pages which are no longer mapped by any processes but still have a valid vnode/offset to some data are kept in a list called the cache list. If a page is needed which is in the cache list, then it can be removed from the list and remapped. This is reported by vmstat in the "re" column as reclaimed. Major page fault in conjunction with vmstat:pages-ins, described below, should be looked at when the high system time is reported.
• pi/po (vmstat: kilobytes pages-in/pages-out)
A page is brought into the memory when applications access the virtual address that is not mapped to the physical page. Each page of physical memory has an associated file (vnode) and offset. The file and offset identify the backing store for the page. Backing store is the location to which the physical page contents will be migrated out (paged out) or migrated in (paged in). Process's memory pages that are part of text or data segment use filesystem (where the program's binary resides) as a backing store. Memory pages used for process stack or heap (allocated typically by malloc) are not read from disk, but are created zero-filled. Should those need to be paged out, a swap device is used. These are also known as "anonymous" pages, as they have no identity in a filesystem.
One can find out the source of page-in/page-out using "vmstat -p". page-in and page-out activity at the file system level reported as "fpi/fpo/fpf", is a sign that application is actively doing read(),write(), mmap() operations on files residing on the file system. Anonymous memory pagin and pageout are reported as "api/apo/apf". Anonymous memory pagein/pageout activity is a sign of memory pressure since I/O to the swap device is an indication of memory shortage.
High Paging activity to/from the file system and swap (anonymous memory pages) can result in high cpu sys usage.
Signs of Memory Shortage:
• sr (vmstat: scan rate)
sr is the number of pages scanned by the scanner (page daemon) per second. On Solaris 8 and above, sr column reporting a non-zero value means shortage of memory. For Solaris 7 and below sr column alone cannot tell about memory shortages. Both sr column and IO to the swap device should be looked at to determine memory shortages on Sol7 and below.
Multiple factors can cause the memory pressure:
• Memory requirement of the workload is larger than physical memory.
• Large memory allocation in kernel due to a bug or an excessive tuning
• Application or kernel driver memory leak.
To find the source of large memory allocation in a user application, use pmap(1), ps(1) or prstat(1M) utilities. For kernel memory allocation, usekstat(1M)

# kstat -n system_pages|grep pp_kernel
pp_kernel 3345881
where pp_kernel represents number of 8k(sparc) and 4k(intel) pages allocated to kernel
You can also get a breakdown of memory usage by application, file system and
kernel by running command below:

# echo ::memstat | mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 158364 1237 15%
ZFS File Data 684405 5346 67%
Anon 30341 237 3%
Exec and libs 5932 46 1%
Page cache 26912 210 3%
Free (cachelist) 34968 273 3%
Free (freelist) 88081 688 9%

Total 1029003 8039
Physical 1013090 7914

Where:
Kernel: Memory allocated by kernel and drivers
ZFS File Data: Memory allocated for caching ZFS file system blocks.
Anon: Memory allocated by application for heap, stack, COW
Exec and libs: Memory allocated by application libraries
Page cache: Memory allocated to UFS file system cache
Free (cachelist): Available memory. Memory with vnode and offset intact
Free (freelist): Available memory.

The page scanner is activated when free memory, represented by kernel variable
"freemem", drops below the "lotsfree" threshold. lotsfree is set to 1/64 of total
memory). Page scanner steels memory (in 8k page-sized chunk)from the application
and filesystem buffers that haven't been referenced recently, syncs them with
the backing store if they've been modified, then frees them for other uses.
High scan rate is a sign that page scanner daemon is running more frequently and
for a longer period and in turn effects cpu kernel time.

One can monitor memory thresholds by running:
# kstat -n system_pages

name: system_pages class: pages
availrmem 13420400
crtime 163.131762
desfree 128635
desscan 25
econtig 189792256
fastscan 677273
freemem 7151671
kernelbase 16777216
lotsfree 257271
minfree 64317
nalloc 50280640
nalloc_calls 18093
nfree 37487807
nfree_calls 12382
nscan 0
pagesfree 7151671
pageslocked 3029092
pagestotal 16465378
physmem 16487075
pp_kernel 3346379
slowscan 100
snaptime 4952283.6547535

Where values are reported in 8k(sparc) and 4k(intel) units

See Documents:
1012111.1 Solaris:Discrepancy in freemem and available swap on systems configured without physical swap
1003383.1 Page Cache on Solaris 8 consumes large amounts of memory for more detail.

Fsflush Cpu Usage:

Primary task of file system flush daemon, fsflush, is to search and flush dirty
file system pages to the backing store of the file system that the file(s) reside
in. By default, fsflush is configured to wake up every 5 seconds (in recent releases
it is changed to 1 second) and search 1/6th of physical memory for dirty pages
- thus flushing all of physical memory within 30 seconds.

On large memory systems, the fsflush daemon will consume a large amount of cpu
if there are many dirty pages in the file system page cache. In some circumstances,
the daemon will not be able to complete it's scan of 1/6th of physical memory at
each scheduled wake up.

The amount of time being spent by fsflush can be observed by taking multiple
snapshots using ps(1) to view the cpu time (TIME column) consumed by the daemon:

# ps -aef |grep fsflush
UID PID PPID C STIME TTY TIME CMD
root 3 0 0 10:14:13 ? 132:13 fsflush

The amount of time spent in fsflush can also be observed on an idle machine.
A 96GB machine with many file system pages will use 100% of a CPU for
approximately 3 seconds, each time the file system daemon wakes up:

# mpstat 5
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 4 0 109 201 101 15 0 0 12 0 64 0 99* 0 1
1 39 0 4 1 1 6 0 0 3 0 92 0 0 0 100
2 15 0 1473 10 10 14 0 0 1 0 21 2 0 2 95
3 21 0 1726 1 1 56 0 0 47 0 286 1 0 3 96
4 22 2 123 0 0 11 0 0 1 0 68 0 0 1 98

A larger value for the autoup parameter reduces the fsflush cpu usage at a cost
of keeping the dirty buffer in memory for longer period . If fsflush runs for a
longer period, it can effect application performance since it runs in kernel mode.
It is therefore recommended to increase the default (30 seconds) to (nGB x 4).
For example, a system configured with 96 GB should have an autoup value of 384.
Edit the /etc/system file and add the following line and reboot.
set autoup=384

High Cross-calls:
Cross-calls is an operations that must be performed on one or all cpus. Kernel
uses cross-calls for two main purposes:

• To implement interprocessor interrupts: Interprocessor interrupts are delivered by the kernel using "poke_cpu()" function and it is used for such tasks : prempting the dispatcher, delivering a signal to thread running on another cpu, or for starting/stopping threads running on different processor when /proc interface is used.
• To synchronize the memory management units (MMU) on all cpus. When a page is unmapped due to user process exiting or issuing an unmap operation a cross-call is needed to flush the TLBs (Translation Lookaside Buffer) entries for the unmapped page from all cpus MMU on which the thread last run.
• TLB is part of the memory management unit (MMU) and used for cacheing translation entries such as (PTE in case of Sparc, TTE in case of UltraSparc). The objective of the TLB is to cache as many recent page translations in hardware as possible, so that it can satisfy a process's or thread's memory access by performing all of the virtual-to-physical translation on-the-fly. If the translation is not found in the TLB, system has to look for virtual to physical memory translation from the larger table in main memory, which is several order slower.
• When a TLB entry is invalidated, all CPUs which may have a copy of that TLB must be notified to invalidate their copy. This is done via a cross-call. A cross-call is implemented with a cross-trap. Cross-calls use a special cross-trap which records the presence of a cross-call. Once the cross-trap is done, it immediately executes the cross-call. Majority of cross-call events are triggered by HAT (Hardware Address Transalation) layer flushing stale TLB entries on the other cpus. This is either due to process exiting or the piece of memory is reclaimed for other use. Kernel tracks the cpus on which a process runs so that when the process exits it can send cross-traps to only those cpus the process ran on. For kernel unmapped pages, cross-call is made to all cpus. Some number of cross-traps come from the activities of the cache-scrubber.

Trapstat(1M) utility (-t option), available in Solaris 9 and above, breaks out TLB misses by kernel and user ("k" vs. "u"); differentiates between TSB hits and TSB misses ("tlb-miss" vs."tsb-miss") and reports the amount of time spend in each type of handler (%tim) as shown below:

cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim

0 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 k| 0 0.0 0 0.0 | 31 0.0 0 0.0 | 0.0

1 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
1 k| 0 0.0 0 0.0 | 11 0.0 0 0.0 | 0.0

2 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
2 k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0

3 u| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
3 k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
It is not recommended to run "trapstat -t" during busy hours due to its high overhead (upto 7% on memory intensive workload). If interested only the summary of tlb misses on each cpu, use trapstat without any option. It has relatively lower overhead than running it with "-t" option.
MPSS (Multiple Page Size Support) introduced in Solaris[TM] 9 improves the TLB hit rate by increasing the page size cached in the hardware TLB from 8KB to 4 MB. Also, lpoob (Large Page Out Of Box) feature introduced in Solaris 10 use large pages for process heap, stack and private anonymous memory. Large pages allow better performance for some workload due to larger working set can be cache and thus avoids frequent TLB misses.

Detail on how MPSS improves application performance and how to measure TLB hits using trapstat, see: Document: 1010458.1 , Taking advantage of Multiple Page Size Support in Solaris[TM] 9 and later.

Frequency of Interrupts:
• intr/ithr (mpstat: interrupt/Interrupt as thread)
Interrupt is a mechanism used by devices such as: disk, network card etc. to signal a completion of an io or some activity. Interrupt causes cpu to stop the execution of running thread since interrupt requires immediate processing. Kernel services the interrupt by vector transfer the cpu execution to the interrupt handler routine that matches the interrupting device. High number of interrupts to a particular cpu could be caused by a bad hardware or binding several high interrupt generating device to the same processor.
"ithr" column records low level interrupts that are handled as a separate thread. Interrupts below level 10 are handled the same way as other solaris threads. In contrast, interrupts above level 10 uses (borrows) the LWP from the executing thread. The interrupted thread in that case is pinned, which avoids the overhead of context switch. Interrupts handler routines of level 10 or below (interrupts handler above 10 cannot block) can block if necessary, using regular synchronization primitives (see above). Each processor maintains a pool of partially initialized interrupt threads, one for each of the lower 9 levels plus a system wide thread for the clock interrupts. This approach allows simple, fast allocation of threads at the interrupt dispatch time, and low latency interrupt response time without impacting the scaling caused by interrupt blocking for synchronous data access.
intrstat(1M) utility available in Sol10 that report interrupt statistics. kstat(1M) can be also be used to capture interrupt statistics per cpu

level-1 288227202512057
level-10 32914470420237
level-11 0
level-12 108930
level-13 86205745317
level-14 2341214414862
level-15 0
level-2 0
level-3 0
level-4 349787125
level-5 52446672
level-6 1114480
level-7 10737622850102
level-8 0
level-9 697429014107
trapstat(1M) utility (-i option) on Solaris 9 breaks out the int-vec and level-1 through level-15 traps by interrupt number, programmable interrupt level (PIL), interrupting device, and device instance number as shown below:
# trapstat -i

inum pil device | cpu0 cpu1 cpu4 cpu5

db 5 fas#0 | 51 0 0 0
e2 5 soc#0 | 0 0 0 123
d3 5 soc#1 | 2 0 0 0
....
....
343 5 soc#17 | 0 0 0 0
dc 7 hme#0 | 125 0 0 0
1dc 7 hme#1 | 0 0 170 0
25c 7 hme#2 | 71 0 0 0
...
Number Of System calls:
System call is the only way for user application to access the kernel resources. When the user program makes a system call, such as open file, the open routine (a wrapper to open system call) pushes the system call number (which is used to identify the particular system call in the kernel) onto the user stack and then invokes a special SPARC trap instruction , Tcc, aka. Software trap. The primary function of this instruction is to change the execution mode of the processor to kernel. Kernel then invokes system call handler called syscall().
Not all system calls have same overhead. Type of system call issued by the application and time taken to service that system call can be monitored by truss(1) utility.

# truss -c ls
# truss -aefd -o truss.out -p pid_of_process
For detail about how to use truss, see
Document: 1005788.1 Using truss for program and error analysis
Document: 1011708.1 Basics of Using the truss Command
Document: 1007414.1 truss interpretation: an example
Truss reports system calls generated by the particular process. If interested in system calls generated system wide than use prex(1). prex is used for enabling and traceing TNF probes in a user process or the kernel.

# prex -k
prex> list probes syscall
name=syscall_end enable=off trace=off file=../../sparc/os/syscall.cline=853 funcs=
name=syscall_start enable=off trace=off file=../../sparc/os/syscall.cline=541 funcs=
prex> buffer alloc 2m
Buffer of size 2097152 bytes allocated
prex> enable syscall
prex> trace syscall
prex> ktrace on

Use tnfxtract(1) to dump the buffer containing syscall probes into file
Use tnfdump(1) to convert the binary file into an ascii.
File /etc/name_to_sysnum provides sysycall name to number mapping.
This information may be used for interpreting the file containing syscal TNF probes data.

Kernel Profiling:
All the system activities discussed above invoke various kernel routines to perform the task. In Solaris 8 and above lockstat(1M) utility has a kernel profile option that offers excellent tracing inside the kernel. Kernel profiling offers an insight into where in the kernel cpus are spending majority of their time or what kernel routines are monopolizing the cpus in the kernel mode. Kernel profiling lists all the active kernel routines during the sample period and the cpu time spend in running these kernel routines.

To collect kernel profile data for 30 seconds, type:

#lockstat -o lockstat_profile.out -I -i 971 -s 40 -n 720000 -W sleep 30

Instead of recording lock events lockstat collects profiling data when "I" option is specified. "-i"
option requests that the profiling is performed at 971 Hz (maximum value is 100 Hz). By default, lockstat uses 10,000 buckets for recording profiling events such as: PC, CPU, PIL. When lockstat runs out of buckets for recording
events, it prints out the message:

lockstat: warning: ran out of data records.

To validate that no events were dropped by lockstat follow the steps below:

$ head -2 lockstat.out

Profiling interrupt: 1165219 events in 600.018 seconds (1942 events/sec) versus the expected number of
events: NCPU x frequency x sample period:

$ bc -l
2*971*600.018
1165234.956

To make sure there are plenty of buckets available to do profiling of the busy workload, specify larger
value to "-n" option. A simple way to check the number of buckets actually used:

$ wc -l lockstat.out
6234 lockstat.out

For more information, see lockstat man page and
Document: 1001812.1 - How to determine what is consuming CPU system time
using the lockstat command.
Prstat:
When it comes to performance, what matters most is how the application is getting impacted by the activities described above. Prstat(1M) utility has an option to collect microstate process accounting that provides information similar to what is reported by mpstat and vmstat. However, instead of reporting system wide information, prstat "-m" option reports process specific events such as:
SYS The percentage of time the process has spent in system mode.
DFL The percentage of time the process has spent processing data page faults. It is a sign that memory shortage is effecting application performance.
LCK The percentage of time the process has spent waiting for user locks. This is a sign of resource contention in user code.
SLP The percentage of time the process has spent sleeping.
LAT The percentage of time the process has spent waiting for the CPU. It means there are too many runnable threads in the cpu dispatch queue competing for the cpu resources.
VCX The number of voluntary context switches.
ICX The number of involuntary context switches. It means thread is getting preempted due to high priority runnable thread or has exhausted its time slot.
SCL The number of system calls
To discuss this information further with Oracle experts and industry peers, we encourage you to review,
join or start a discussion in the My Oracle Support Community, Oracle Solaris Kernel Community.

References:

How to Analyze High CPU Utilization In Solaris [ID 1008930.1]

Get Oracle Certifications for all Exams
Free Online Exams.com

8 comments:

Anonymous said...: I'm sure the best for you [URL=http://www.chanel-bags--outlet.com/]chanel bags outlet[/URL] and get big save tarQjjrM [URL=http://www.chanel-bags--outlet.com/ ] http://www.chanel-bags--outlet.com/ [/URL]; November 13, 2012 at 5:56 AM
Anonymous said...: Commonly, creator purses and handbags are usually not made from you into consideration. http://www.latestbagsdesign.com/ jcgfnxj [url=http://www.cheapbagswholesalesd.com/]moncler vest[/url] fxhyxcc http://www.fashionjacketsonline.com/; December 2, 2012 at 3:03 PM
Anonymous said...: At discounted prices, no less!. http://www.latestbagsdesign.com/ bncvgfb [url=http://www.cheapbagswholesalesd.com/]moncler vest[/url] xvourbr [url=http://www.fashionjacketsonline.com/]coco chanel[/url]; December 2, 2012 at 4:12 PM
Anonymous said...: handbag, it's possible to begin to see the real look and this series is very sensitive type, (aside: Therefore if you wish to buy a copy of the big tote, try not to buy may be the skin, such as GUCCI, canvas series, Prada parachute series, China imitation is actually hard to distinguish even the stores own it, hehe) http://www.latestbagsdesign.com/ hpnvdbf [url=http://www.cheapbagswholesalesd.com/]moncler coats[/url] bkgsvtc [url=http://www.fashionjacketsonline.com/]coco chanel[/url]; December 2, 2012 at 5:21 PM
Anonymous said...: DVDs are the hot word in family entertainment so if you have a good stock of DVDs and you're selling at unbelievable prices, you're going to sell the whole lot in no time at all. [url=http://www.latestbagsdesign.com/]gucci store[/url] pzxtojx [url=http://www.cheapbagswholesalesd.com/]moncler[/url] mjerbeb [url=http://www.fashionjacketsonline.com/]chanel bags[/url] ayaidil [url=http://www.topdesignbags.com/]cheap gucci handbags[/url] iparugx [url=http://www.hotsalebagsonline.com/]chanel handbags[/url] yoruads [url=http://www.outletjacket.co.uk/]moncler outlet[/url]; December 3, 2012 at 1:55 AM
Anonymous said...: In fact, it has proved that if you want to lose weight, the importance is how long you exercise and have nothing to do with how hard you exercise.. [url=http://www.latestbagsdesign.com/]cheap gucci bags[/url] kuhrkmb http://www.cheapbagswholesalesd.com/ njzgrjp [url=http://www.fashionjacketsonline.com/]chanel bags[/url] orqjfdx [url=http://www.topdesignbags.com/]gucci outlet[/url] qhbdcsp [url=http://www.hotsalebagsonline.com/]chanel bag[/url] yulwlkt [url=http://www.outletjacket.co.uk/]moncler[/url]; December 3, 2012 at 2:55 AM
Anonymous said...: [url=http://louboutinshop.co.uk]christian louboutin[/url] It includes tools and data to perform analysis on key issues. [url=http://dkgoose.com]Canada Goose jakke[/url] Icwkgusvh [url=http://canadagoosesweden.com]Canada goose outlet [/url]
hqyudh 575699 [url=http://www.canadagoosestorontofactory.ca]canada goose arctic parka[/url] 119622 [url=http://www.officialcanadagooseparkas.ca]canada goose authentic[/url]; December 13, 2012 at 6:26 AM
Anonymous said...: Hey There. I found your [url=http://www.rayban-cheap4uk.co.uk]ray ban clubmaster[/url] weblog the usage of msn. This is a really smartly written article. I will make sure to bookmark it and come back to read extra of your helpful info. Thank you for the post. [url=http://www.rayban-cheap4uk.co.uk]ray ban uk[/url] , I will definitely return.; January 22, 2013 at 10:27 AM

Wednesday, October 12, 2011

How to Analyze High CPU Utilization In Solaris

8 comments: