Upon completion of this module, you should be able to:
1,differentiate watchdog resets, panics, and system hangs
2,differentiate hardware and software problems
3,provide examples of fatal and non-fatal error conditions
4,identify a comprehensive set of Solaris commands and utilities which are useful in fault analysis
5,describe the syntax, function, and relevance of each command or system file
6,use Solaris commands and files to determine system configuration and status information
7,solve workshop problems using Solaris utilities and system file
error categories-software, hardware-corrected, recoverable, fatal, and critical
error reporting mechanisms-bus errors, interrupts, and resets
Recoverable errors caused by hardware are usually signaled by a bus error posted to the requesting device and a specified interrupt, which could broadcast the error. Error recovery in such cases is normally handled by the trap routines, while error logging is done by the interrupt handler.
Critical errors require immediate attention, system shutdown, and power-off. They are notified through a high-level broadcast interrupt if at all possible.
A fatal error is a hardware error in which proper system operation cannot be guaranteed. All fatal errors initiate a system-watchdog reset. Parity errors on backplanes are an example of a fatal error.
Bus errors are one of the mechanisms for error reporting on the system. Bus errors are issued to the processor when the processor references a virtual or physical location that cannot be satisfied for hardware reasons. some typical bus errors that occur are:
Illegal address or internal hardeare failure
instruction fetch or data load
on an SBus, direct virtual memory access(DVMA) operations
synchronous/asynchronous data store
memory management unit(MMU) operations
System Watchdog Reset
When a fatal error is detected on a multiprocessor machine, a system watchdog reset is initiated. A system watchdog reset affects all CPUs and I/O devices. Writes in progress may be lost, but the state of main memory is not altered and continues to be refreshed after a system watchdog reset. In most cases, the system watchdog reset condition is hardware related.
The modinfo utility displays information ablut loaded kernel modules. With no options, it displays all loaded modules with their associated module identification number and module name.
# modinfo
The modload utility loads a kernel module into a running system
# modload -p misc/obpsym
in the /etc/system file:
forceload: misc/obpsym
The modunload utility unloads a kernel module from a running system
# modinfo | grep obpsym
# modunload -i 89
netstat -i -lists statistics per interface
netstat -r -lists routing table statistics
The truss utility, also known as trace on the Sun Berkeley System Distribution, traces system calls,library calls, and signal activity for the program passed to it as an argument on the command line. It is extremely helpful in determining how programs execute, and identifying points of failure in programs which return error conditions.
There are two main categories of errors which truss reports:
a system call error,often due to an invalid argument being passed to the system call. The man pages on the system calls are a helpful resource, as is the header file /usr/include/sys/errno.h
missing file errors,often manifest with the open() system call statements. Usually, the condition is that the executing program needs to open a file which cannot be found, or for which the contents of the file are invalid or corrupt.
An excerpt of the header file containing the main errors shown in the truss example is included here. this file can be examined on-line in the /usr/include/sys directory
# cat /usr/include/sys/errno.h
阅读(2366) | 评论(0) | 转发(0) |