blob: dcbeec6a5b2fd38cd3eb0ebebda7c9374bfd5a1d [file] [log] [blame]
<html>
<head>
<title>Monitoring Jobs and Systems</title>
<link rel="stylesheet" type="text/css" href="help.css">
<script type="text/javascript" src="thumb.js"> </script>
<head>
<body>
<h1 id="top">Monitoring Jobs and Systems</h1>
<p>This section describes the features of PTP the enable the developer to monitor activity on target parallel machines, to monitor job status, and to control jobs.
It will cover the following topics:</p>
<ul>
<li><a href="#resMgr">Resource Managers view</a></li>
<li><a href="#output">Console view</a></li>
<li><a href="#persp">Parallel Runtime perspective</a></li>
<ul style="margin:0;">
<li><a href="#machines">Machines view</a></li>
<li><a href="#jobs">Jobs List view</a></li>
<li><a href="#legend">Icon Legend</a></li>
</ul>
<li><a href="#sysMonPersp">System Monitoring perspective</a> (new for 5.0)</li>
<ul style="margin:0;">
<li><a href="#sysMonView">System Monitor view</a></li>
<li><a href="#active">Active Jobs view</a></li>
<li><a href="#inactive">Inactive Jobs view</a></li>
</ul>
</ul>
<p>
PTP provides two perspectives for job and system monitoring: the <a href="#persp">Parallel Runtime perspective</a> and the
<a href="#sysMonPersp">System Monitoring perspective</a>. The Parallel Runtime perspective is the original perspective provided
by PTP and is used for most of the current resource managers. The System Monitoring perspective is new for PTP 5.0 and provides
scalable monitoring of large remote systems. It is currently only used by the PBS resource manager, but addition resource managers
will be transitioned over to this perspective in future releases.
</p>
<h2 id="resMgr">Resource Managers View</h2>
<p>The Resource Managers view shows all resource managers that have been configured, and is used to manage and control these resource managers.
This view is shared between the Parallel Runtime and System Monitoring perspectives.
Each resource manager has an icon and a name. The icon color indicates the current state of the resource manager. The following image shows two resource
managers, one that is <i>stopped</i> and one that is <i>running</i>. A stopped resource manager is know to the system, but is not providing any information
to PTP. A running resource manager is the normal state, and indicates that PTP is receiving information and can launch jobs using the resource manager.</p>
<table><tr>
<td><img src="images/05runtimeResMgrView.png"></td>
<td><img src="images/05ptpLegendResMgr.png"></td>
</tr></table>
<p>This view can also be used to create new resource managers, edit or remove resource managers, and control resource manager operation. Right-click in the
view to access these functions.</p>
<p><b>Note that if a resource manager is removed and re-added, the launch configurations using the original resource
manager must be changed to use the new one, even if it has the same name.</b></p>
<h2 id="output">Console View</h2>
Depending on the functionality of the resource manager, PTP can also display standard output and standard error from the parallel program in the
Console view. This view is also shared between the Parallel Runtime and System Monitoring perspectives.
Output is only displayed in this view if the <b>Display combined output in a console
view</b> option was selected in the job launch configuration (see <a href="03pLaunchConfig.html">Running Parallel Programs</a>). Output
from jobs visible in the System Monitoring perspective can also be shown by selecting the appropriate actions (more below).</p>
<p><img src="images/05ptpRuntimeConsoleView.png"></p>
<h2 id="persp">Parallel Runtime Perspective</h2>
<i>For use with all resource managers <b>except</b> PBS</i>
<p>
The Parallel Runtime perspective is used to monitor the status of target parallel systems and the parallel jobs that are running on these systems.
At least one resource manager must be active to see anything in the views. See <a href="02resMgrSetup.html">configuring resource managers</a> for
information on setting up resource managers and <a href="03pLaunchConfig.html">running parallel programs</a> for how to launch a parallel program.</p>
<p>The perspective provides two main views for monitoring systems and jobs: <a href="#machines">Machines view</a> and <a href="#jobs">Jobs List view</a>.
Each of these views will be discussed in more detail below.</p>
<p>To open the Parallel Runtime perspective,
select <b>Window &gt; Open Perspective &gt; Other ...</b> and choose <b>Parallel Runtime</b> from the list.</p>
<p><img src="images/05runtimePerspAnn.png"></p>
<h3 id="machines"><img src="images/05parallel_perspective.png">Machines View</h3>
The <b>Machines view</b> shows the status of all machines being controlled by running resource managers. The upper left-hand panel of this view shows a
collective list of all machines known by all the resource managers. A machine is represented by an icon and an address. The icon represents the state of the
machine, and the address is typically the hostname of the machine.</p>
<p>Selecting one of the machines in the upper left-hand panel will show the nodes of that machine in the upper right-hand panel of this view. Nodes are
represented by an icon only. The icon shows the state of the node. The following image shows a typical view:</p>
<p><img src="images/05runtimeMachinesView.png"></p>
<p>The left edge of the node panel displays the node number of the first node in the row. This is useful for quickly locating a particular node. Also, if
there are too many nodes to fit in the display the zoom buttons in the view toolbar can be used to zoom the display.</p>
<p>The machine and node icons indicate the state of each machine and node, as shown in the following image. There are icons representing most typical
states. There are also node states that indicate access to the nodes that could be controlled by a job scheduler (user exclusive, user shared, etc.)
These states are only used by certain types of resource managers.</p>
<p><img src="images/05ptpLegendMachines.png"></p>
<p>Placing the mouse over a node in the view will show information about that node, including the node number, in a tooltip popup.</p>
<p>Double-click on a node icon to display the more detailed information about the node in the lower two panes of the view.
The lower left-hand pane (Node Attributes) will show the detailed attributes of the node, and the lower right-hand pane (Process Info) will show the processes
that are running, or have recently run, on the node.</p>
<p><img src="images/05runtimeNodeDetails.png"></p>
<h3 id="jobs"><img src="images/05parallel_perspective.png">Jobs List View</h3>
This view shows the current status of jobs in the system. Pending, running, and completed jobs are shown. The actual jobs displayed in this
view are resource manager dependent, but will typically be the user's jobs that have been launched by PTP. Some resource managers may show jobs
that have been launched using other means, or jobs for all users on the system.</p>
<p><img src="images/05ptpRuntimeJobsView.png"></p>
<p>There are icons representing most job and process states. The following image shows the states that can be represented:</p>
<p><img src="images/05ptpLegendJobProcess.png"></p>
<p>Usually a job will terminate when it finishes executing. However the user can also terminate a job using the <b>terminate button</b> on the toolbar
of the Jobs List view.
This button will become enabled when a running job is selected. Clicking on the button will instruct the resource manager to terminate the job. If the
job is pending in a queue, then it will normally be removed from the queue.</p>
<h3 id="legend">Icon Legend</h3>
There are many different icons representing the state of the various components of the parallel system. If you need to identify a particular icon,
click on the legend icon <img src="images/05legendIcon.png"> in the toolbar. This will open a dialog that shows all the icons and their meanings.
An example is shown below.</p>
<p><img src="images/05ptpLegend.png"></p>
<h2 id="sysMonPersp">System Monitoring Perspective (new for 5.0)</h2>
<i>For use with PBS resource manager</i>
<p>
The System Monitoring perspective provides scalable job and system monitoring for large-scale systems. It is based on the
<a href="http://www2.fz-juelich.de/jsc/llview">LLview monitoring system</a>
but has been extended to support monitoring of any type of system. The System Monitoring perspective is currently only used for the PBS
resource manager, but will be extended to other resource managers in the future.
</p>
<p>
Like the Parallel Runtime perspective, at least one resource manager must be active to see anything in the views.
See <a href="02resMgrSetup.html">configuring resource managers</a> for
information on setting up resource managers and <a href="03pLaunchConfig.html">running parallel programs</a> for how to launch a parallel program.</p>
<p>To open the System Monitoring perspective,
select <b>Window &gt; Open Perspective &gt; Other ...</b> and choose <b>System Monitoring</b> from the list.</p>
<p><img src="images/05sysMonPersp.png"></p>
<p>The perspective provides four main views for monitoring systems and jobs: <b><a href="#resMgr">Resource Managers view</a></b>, <b>System Monitor view</b>,
<b>Active Jobs view</b>,
and <b>Inactive Jobs view</b>. The Resource Managers view is the same view used for the Parallel Runtime perspective. Each of the other
views will be discussed in more detail below.</p>
<h3 id="sysMonView">System Monitor View</h3>
The System Monitor view provides an overall view of the activity on the target remote system. The view tab will contain the name of the remote system
(as provided by the resource manager). The layout of this view will depend on the configuration of
the target system, but will generally consist of a number of boxes that represent aggregations of computing resources (such as racks). These boxes may in turn
contain other elements that represent the computing resources (such as nodes). The color of the boxes is used to indicate which jobs are running on the nodes.
<p>
The view currently supports two mouse actions. Hovering over an element in the view will display a tooltip box with information about that element, including
which jobs are associated with the element. Clicking on an element will highlight all associated elements (those with the same color) in the display. This
shows the user where a particular job is running on the system.
<p><img src="images/05sysMonView1.png"></p>
<h3 id="active">Active Jobs View</h3>
The Active Jobs view shows a list of all the jobs that are running on the system. The exact columns displayed in the view will depend on the capabilities of
the remote system. Each job in the table is assigned a color in the first column. This color corresponds to the colors displayed in the System Monitor view.
Clicking on a row in the table will highlight the row and also the location of the jobs in the System Monitor view.
<p><img src="images/05activeJobs1.png"></p>
<p>
Job actions are available by right-clicking on a job in the view. The actions available will depend on the type of job, its state, and the job owner.
<p><img src="images/05activeJobs3.png"></p>
<p>
Rows in the table can be sorted by clicking on the column heading. This will cycle though a sort sequence of "ascending", "descending", and none.
Columns can also be removed from the view by right-clicking on the column heading and unselecting the column name.
<p><img src="images/05activeJobs2.png"></p>
<h3 id="inactive">Inactive Jobs View</h3>
The Inactive Jobs view is essentially the same as the Active Jobs view, but displays jobs that are not currently running on the system. As these jobs don't
have associated nodes in the System Monitor view, they are not assigned a color.
<p>
Jobs that are launched by the user will initially appear in the Inactive Jobs view with status SUBMITTED. These jobs can be controlled (e.g. canceled)
by right clicking on the job and selecting an available action. <b>Refresh Job Status</b> can be used to get an immediate update of the job status rather
than waiting for the next update.
<p><img src="images/05inactiveJobs1.png"></p>
<p>
Once a job has finished executing, it will appear in this view with status COMPLETED. The stdout and stderr from the job can be displayed in
the <a href="#output">Console view</a> by right clicking on the job and
selecting the appropriate action. Completed jobs will remain in the view between Eclipse sessions, so you can leave Eclipse and return at a later time without
losing information about the jobs. If you wish to remove the job from the view (permanently), use the <b>Remove Job Entry</b> action.
<p><img src="images/05inactiveJobs2.png"></p>
<p><a href="#top">Back to Top</a> | <a href="toc.html">Back to Table of Contents</a>
</body>
</html>