Project

General

Profile

Download (11.5 KB) Statistics
| Branch: | Revision:
1
<header>MON Help on Service Definitions</header>
2
<p>This is second and last stage for MON configuration.
3
<p>Default values are shown for the Mandatory services <marked in RED color>. See the respective help topic below for more help on the Service Definitions.
4
<p>For <b>"mail.alert"</b>, ensure that the sendmail is configured and <b>"sendmail"</b> deamon is started on the hostmachine.
5

    
6
<H3>Service Definitions</H3>
7

    
8
<P>
9
<DL COMPACT>
10
<DT><B>service</B><I> servicename</I>
11

    
12
<DD>
13
A service definition begins with they keyword
14
<B>service</B>
15

    
16
followed by a word which is the tag for this service.
17
<P>
18
The components of a service are an interval, monitor, and
19
one or more time period definitions, as defined below.
20
<P>
21
If a service name of &quot;default&quot; is defined within a watch
22
group called &quot;dafault&quot; (see above), then the default/default
23
definition will be used for handling unknown mon traps.
24
<P>
25
<DT><B>interval</B><I> timeval</I>
26

    
27
<DD>
28
The keyword
29
<B>interval</B>
30

    
31
followed by a time value specifies the frequency that
32
a monitor script will be triggered.
33
Time values are defined as &quot;30s&quot;, &quot;5m&quot;, &quot;1h&quot;, or &quot;1d&quot;,
34
meaning 30 seconds, 5 minutes, 1 hour, or 1 day. The numeric portion
35
may be a fraction, such as &quot;1.5h&quot; or an hour and a half. This
36
format of a time specification will be referred to as
37
<I>timeval</I>.
38

    
39
<P>
40
<DT><B>traptimeout</B><I> timeval</I>
41

    
42
<DD>
43
This keyword takes the same time specification argument as
44
<B>interval</B><I>,</I>
45

    
46
and makes the service expect a trap from an external source
47
at least that often, else a failure will be registered. This is
48
used for a heartbeat-style service.
49
<P>
50
<DT><B>trapduration</B><I> timeval</I>
51

    
52
<DD>
53
If a trap is received, the status of the service the trap was delivered
54
to will normally remain constant. If
55
<B>trapduration</B>
56

    
57
is specified, the status of the service will remain in a failure
58
state for the duration specified by
59
<I>timeval</I>,
60

    
61
and then it will be reset to &quot;success&quot;.
62
<P>
63
<DT><B>randskew</B><I> timeval</I>
64

    
65
<DD>
66
Rather than schedule the monitor script to run at the start of each
67
interval, randomly adjust the interval specified by the
68
<B>interval</B>
69

    
70
parameter by plus-or-minus
71
<B>randskew.</B>
72

    
73
The skew value is specified as the
74
<B>interval</B>
75

    
76
parameter: &quot;30s&quot;, &quot;5m&quot;, etc...
77
For example if
78
<B>interval</B>
79

    
80
is 1m, and
81
<B>randskew</B>
82

    
83
is &quot;5s&quot;, then
84
<I>mon</I>
85

    
86
will schedule the monitor script some time between every
87
55 seconds and 65 seconds.
88
The intent is to help distribute the load on the server when
89
many services are scheduled at the same intervals.
90
<P>
91
<DT><B>monitor</B><I> monitor-name [arg...]</I>
92

    
93
<DD>
94
The keyword
95
<B>monitor</B>
96

    
97
followed by a script name and arguments
98
specifies the monitor to run when the timer
99
expires. Shell-like quoting conventions are
100
followed when specifying the arguments to send
101
to the monitor script.
102
The script is invoked from the directory
103
given with the
104
<B>-s</B>
105

    
106
argument, and all following words are supplied
107
as arguments to the monitor program, followed by the
108
list of hosts in the group referred to by the current watch group.
109
If the monitor line ends with &quot;;;&quot; as a separate word,
110
the host groups are not appended to the argument list
111
when the program is invoked.
112
<P>
113
<DT><B>allow_empty_group</B>
114

    
115
<DD>
116
The
117
<B>allow_empty_group</B>
118

    
119
option will allow a monitor to be invoked even when the
120
hostgroup for that watch is empty because of
121
disabled hosts. The default behavior is not
122
to invoke the monitor when all hosts in a hostgroup
123
have been disabled.
124
<P>
125
<DT><B>description</B><I> descriptiontext</I>
126

    
127
<DD>
128
The text following
129
<B>description</B>
130

    
131
is queried by client programs, passed to alerts and monitors via an
132
environment variable. It should contain a brief description of the
133
service, suitable for inclusion in an email or on a web page.
134
<P>
135
<DT><B>exclude_hosts</B><I> host [host...]</I>
136

    
137
<DD>
138
Any hosts listed after
139
<B>exclude_hosts</B>
140

    
141
will be excluded from the service check.
142
<P>
143
<DT><B>exclude_period</B><I> periodspec</I>
144

    
145
<DD>
146
Do not run a scheduled monitor during the time
147
identified by
148
<I>periodspec</I>.
149

    
150
<P>
151
<DT><B>depend</B><I> dependexpression</I>
152

    
153
<DD>
154
The
155
<B>depend</B>
156

    
157
keyword is used to specify a dependency expression, which
158
evaluates to either true of false, in the boolean sense.
159
Dependencies are actual Perl expressions, and must obey all syntactical
160
rules. The expressions are evaluated in their own package space so as
161
to not accidentally have some unwanted side-effect.
162
If a syntax error is found when evaluating the expression, it
163
is logged via syslog.
164
<P>
165
Before evaluation, the following substitutions on the expression occur:
166
phrases which look like &quot;group:service&quot; are substituted with the value
167
of the current operational status of that specified service. These
168
opstatus substitutions are computed recursively, so if service A
169
depends upon service B, and service B depends upon service C, then
170
service A depends upon service C. Successful operational statuses (which
171
evaluate to &quot;1&quot;) are &quot;STAT_OK&quot;, &quot;STAT_COLDSTART&quot;, &quot;STAT_WARMSTART&quot;, and
172
&quot;STAT_UNKNOWN&quot;.  The word &quot;SELF&quot; (in all caps) can be used for the group
173
(e.g. &quot;SELF:service&quot;), and is an abbreviation for the current watch group.
174
<P>
175
This feature can be used to control alerts for services which are
176
dependent on other services, e.g. an SMTP test which is dependent upon
177
the machine being ping-reachable.
178
<P>
179
<DT><B>dep_behavior</B><I> {a|m}</I>
180

    
181
<DD>
182
The evaluation of dependency graphs
183
can control the
184
suppression of either alert or monitor invocations.
185
<P>
186
<B>Alert suppression</B>.
187

    
188
If this option is set to &quot;a&quot;,
189
then the dependency expression
190
will be evaluated after the
191
monitor for the service exits or
192
after a trap is received.
193
An alert will only be sent
194
if the evaluation succeeds, meaning
195
that none of the nodes in the dependency
196
graph indicate failure.
197
<P>
198
<B>Monitor suppression</B>.
199

    
200
If it is set to &quot;m&quot;,
201
then the dependency expression will be evaulated
202
before the monitor for the service is about to run.
203
If the evaulation succeeds, then the monitor
204
will be run. Otherwise, the monitor will not
205
be run and the status of the service will remain
206
the same.
207
<P>
208
</DL>
209
<A NAME="lbAO">&nbsp;</A>
210
<H3>Period Definitions</H3>
211

    
212
<P>
213
Periods are used to define the conditions which
214
should allow alerts
215
to be delivered.
216
<P>
217
<DL COMPACT>
218
<DT><B>period</B><I> [label:] periodspec</I>
219

    
220
<DD>
221
A period groups one or more alarms and variables
222
which control how often an alert happens when there
223
is a failure.
224
The
225
<B>period</B>
226

    
227
keyword has two forms. The first
228
takes an argument which is a
229
period specification from Patrick Ryan's
230
Time::Period Perl 5 module. Refer to
231
&quot;perldoc Time::Period&quot; for more information.
232
<P>
233
The second form requires a label followed by a period specification, as
234
defined above. The label is a tag consisting of an alphabetic character
235
or underscore followed by zero or more alphanumerics or underscores
236
and ending with a colon. This
237
form allows multiple periods with the same period definition. One use
238
is to have a period definition which has no
239
<B>alertafter</B>
240

    
241
or
242
<B>alertevery</B>
243

    
244
parameters for a particular time period, and another
245
for the same time period with a different
246
set of alerts that does contain those
247
parameters.
248
<P>
249
<DT><B>alertevery</B><I> timeval</I>
250

    
251
<DD>
252
The
253
<B>alertevery</B>
254

    
255
keyword (within a
256
<B>period</B>
257

    
258
definition) takes the same type of argument as the
259
<B>interval</B>
260

    
261
variable, and limits the number of times an alert
262
is sent when the service continues to fail.
263
For example, if the interval is &quot;1h&quot;, then only
264
the alerts in the period section will only
265
be triggered once every hour. If the
266
<B>alertevery</B>
267

    
268
keyword is
269
omitted in a period entry, an alert will be sent
270
out every time a failure is detected. By default,
271
if the output of two successive failures changes,
272
then the alertevery interval is overridden.
273
If the word
274
&quot;summary&quot; is the last argument, then only the summary
275
output lines will be considered when comparing the
276
output of successive failures.
277
<P>
278
<DT><B>alertafter</B><I> num</I>
279

    
280
<DD>
281
<P>
282
<DT><B>alertafter</B><I> num timeval</I>
283

    
284
<DD>
285
The
286
<B>alertafter</B>
287

    
288
keyword (within a
289
<B>period</B>
290

    
291
section) has two forms: only with the &quot;num&quot;
292
argument, or with the &quot;num timeval&quot; arguments.
293
In the first form, an alert will only be invoked
294
after &quot;num&quot; consecutive failures.
295
<P>
296
In the second form,
297
the arguments are a positive integer followed by an interval,
298
as described by the
299
<B>interval</B>
300

    
301
variable above.
302
If these parameters are specified,
303
then the alerts for that period will only
304
be called after that many failures happen
305
within that interval. For example,
306
if
307
<B>alertafter</B>
308

    
309
is given the arguments &quot;3&nbsp;30m&quot;, then the alert will be called
310
if 3 failures happen within 30 minutes.
311
<P>
312
<DT><B>numalerts</B><I> num</I>
313

    
314
<DD>
315
<P>
316
This variable tells the server to call no more than
317
<I>num</I>
318

    
319
alerts during a
320
failure. The alert counter is kept on a per-period basis,
321
and is reset upon each success.
322
<P>
323
<DT><B>comp_alerts</B>
324

    
325
<DD>
326
<P>
327
If this option is specified, then upalerts will only be
328
called if a corresponding &quot;down&quot; alert has been called.
329
<P>
330
<DT><B>alert</B><I> alert [arg...]</I>
331

    
332
<DD>
333
A period may contain multiple alerts, which are triggered
334
upon failure of the service. An alert is specified with
335
the
336
<B>alert</B>
337

    
338
keyword, followed by an optional
339
<B>exit</B>
340

    
341
parmeter, and arguments which are interpreted the same as
342
the
343
<B>monitor</B>
344

    
345
definition, but without the &quot;;;&quot; exception. The
346
<B>exit</B>
347

    
348
parameter takes the form of 
349
<B>exit=x</B>
350

    
351
or
352
<B>exit=x-y</B>
353

    
354
and has the effect that the alert is only called if the
355
exit status of the monitor script falls within the range
356
of the
357
<B>exit</B>
358

    
359
parameter. If, for example, the alert line is
360
<I>alert exit=10-20 mail.alert mis</I>
361

    
362
then
363
<I>mail-alert</I>
364

    
365
will only be invoked with
366
<I>mis</I>
367

    
368
as its arguments if the monitor
369
program's exit value is between 10 and 20. This feature
370
allows you to trigger different alerts at different
371
severity levels (like when free disk space goes from 8% to 3%).
372
<P>
373
See the
374
<B>ALERT PROGRAMS</B>
375

    
376
section above for a list of the pramaeters mon will pass 
377
automatically to alert programs.
378
<P>
379
<DT><B>upalert</B><I> alert [arg...]</I>
380

    
381
<DD>
382
An
383
<B>upalert</B>
384

    
385
is the compliment of an
386
<B>alert</B>.
387

    
388
An upalert is called when a services makes the state transition from
389
failure to success. The
390
<B>upalert</B>
391

    
392
script is called supplying
393
the same parameters as the
394
<B>alert</B>
395

    
396
script, with the addition of the
397
<B>-u</B>
398

    
399
parameter which is simply used to let
400
an alert script know that it is being called
401
as an upalert. Multiple upalerts may be
402
specified for each period definition.
403
Please note that the default behavior is that
404
an upalert will be sent
405
regardless if there were any prior &quot;down&quot; alerts
406
sent, since upalerts are triggered on a state
407
transition. Set the per-period
408
<B>comp_alerts</B>
409

    
410
option to pair upalerts with &quot;down&quot; alerts.
411
<P>
412
<DT><B>startupalert</B><I> alert [arg...]</I>
413

    
414
<DD>
415
A
416
<B>startupalert</B>
417

    
418
is only called when the
419
<B>mon</B>
420

    
421
server starts execution.
422
<P>
423
<DT><B>upalertafter</B><I> timeval</I>
424

    
425
<DD>
426
The
427
<B>upalertafter</B>
428

    
429
parameter is specified as a string that
430
follows the syntax of the
431
<B>interval</B>
432

    
433
parameter (&quot;30s&quot;, &quot;1m&quot;, etc.), and
434
controls the triggering of an
435
<B>upalert</B>.
436

    
437
If a service comes back up after
438
being down for a time greater than
439
or equal to the value of this option, an
440
<B>upalert</B>
441

    
442
will be called. Use this option to prevent
443
upalerts to be called because of &quot;blips&quot; (brief outages).
444
<P>
(12-12/12)