1
|
<header>MON Help on Service Definitions</header>
|
2
|
<p>This is second and last stage for MON configuration.
|
3
|
<p>Default values are shown for the Mandatory services <marked in RED color>. See the respective help topic below for more help on the Service Definitions.
|
4
|
<p>For <b>"mail.alert"</b>, ensure that the sendmail is configured and <b>"sendmail"</b> deamon is started on the hostmachine.
|
5
|
|
6
|
<H3>Service Definitions</H3>
|
7
|
|
8
|
<P>
|
9
|
<DL COMPACT>
|
10
|
<DT><B>service</B><I> servicename</I>
|
11
|
|
12
|
<DD>
|
13
|
A service definition begins with they keyword
|
14
|
<B>service</B>
|
15
|
|
16
|
followed by a word which is the tag for this service.
|
17
|
<P>
|
18
|
The components of a service are an interval, monitor, and
|
19
|
one or more time period definitions, as defined below.
|
20
|
<P>
|
21
|
If a service name of "default" is defined within a watch
|
22
|
group called "dafault" (see above), then the default/default
|
23
|
definition will be used for handling unknown mon traps.
|
24
|
<P>
|
25
|
<DT><B>interval</B><I> timeval</I>
|
26
|
|
27
|
<DD>
|
28
|
The keyword
|
29
|
<B>interval</B>
|
30
|
|
31
|
followed by a time value specifies the frequency that
|
32
|
a monitor script will be triggered.
|
33
|
Time values are defined as "30s", "5m", "1h", or "1d",
|
34
|
meaning 30 seconds, 5 minutes, 1 hour, or 1 day. The numeric portion
|
35
|
may be a fraction, such as "1.5h" or an hour and a half. This
|
36
|
format of a time specification will be referred to as
|
37
|
<I>timeval</I>.
|
38
|
|
39
|
<P>
|
40
|
<DT><B>traptimeout</B><I> timeval</I>
|
41
|
|
42
|
<DD>
|
43
|
This keyword takes the same time specification argument as
|
44
|
<B>interval</B><I>,</I>
|
45
|
|
46
|
and makes the service expect a trap from an external source
|
47
|
at least that often, else a failure will be registered. This is
|
48
|
used for a heartbeat-style service.
|
49
|
<P>
|
50
|
<DT><B>trapduration</B><I> timeval</I>
|
51
|
|
52
|
<DD>
|
53
|
If a trap is received, the status of the service the trap was delivered
|
54
|
to will normally remain constant. If
|
55
|
<B>trapduration</B>
|
56
|
|
57
|
is specified, the status of the service will remain in a failure
|
58
|
state for the duration specified by
|
59
|
<I>timeval</I>,
|
60
|
|
61
|
and then it will be reset to "success".
|
62
|
<P>
|
63
|
<DT><B>randskew</B><I> timeval</I>
|
64
|
|
65
|
<DD>
|
66
|
Rather than schedule the monitor script to run at the start of each
|
67
|
interval, randomly adjust the interval specified by the
|
68
|
<B>interval</B>
|
69
|
|
70
|
parameter by plus-or-minus
|
71
|
<B>randskew.</B>
|
72
|
|
73
|
The skew value is specified as the
|
74
|
<B>interval</B>
|
75
|
|
76
|
parameter: "30s", "5m", etc...
|
77
|
For example if
|
78
|
<B>interval</B>
|
79
|
|
80
|
is 1m, and
|
81
|
<B>randskew</B>
|
82
|
|
83
|
is "5s", then
|
84
|
<I>mon</I>
|
85
|
|
86
|
will schedule the monitor script some time between every
|
87
|
55 seconds and 65 seconds.
|
88
|
The intent is to help distribute the load on the server when
|
89
|
many services are scheduled at the same intervals.
|
90
|
<P>
|
91
|
<DT><B>monitor</B><I> monitor-name [arg...]</I>
|
92
|
|
93
|
<DD>
|
94
|
The keyword
|
95
|
<B>monitor</B>
|
96
|
|
97
|
followed by a script name and arguments
|
98
|
specifies the monitor to run when the timer
|
99
|
expires. Shell-like quoting conventions are
|
100
|
followed when specifying the arguments to send
|
101
|
to the monitor script.
|
102
|
The script is invoked from the directory
|
103
|
given with the
|
104
|
<B>-s</B>
|
105
|
|
106
|
argument, and all following words are supplied
|
107
|
as arguments to the monitor program, followed by the
|
108
|
list of hosts in the group referred to by the current watch group.
|
109
|
If the monitor line ends with ";;" as a separate word,
|
110
|
the host groups are not appended to the argument list
|
111
|
when the program is invoked.
|
112
|
<P>
|
113
|
<DT><B>allow_empty_group</B>
|
114
|
|
115
|
<DD>
|
116
|
The
|
117
|
<B>allow_empty_group</B>
|
118
|
|
119
|
option will allow a monitor to be invoked even when the
|
120
|
hostgroup for that watch is empty because of
|
121
|
disabled hosts. The default behavior is not
|
122
|
to invoke the monitor when all hosts in a hostgroup
|
123
|
have been disabled.
|
124
|
<P>
|
125
|
<DT><B>description</B><I> descriptiontext</I>
|
126
|
|
127
|
<DD>
|
128
|
The text following
|
129
|
<B>description</B>
|
130
|
|
131
|
is queried by client programs, passed to alerts and monitors via an
|
132
|
environment variable. It should contain a brief description of the
|
133
|
service, suitable for inclusion in an email or on a web page.
|
134
|
<P>
|
135
|
<DT><B>exclude_hosts</B><I> host [host...]</I>
|
136
|
|
137
|
<DD>
|
138
|
Any hosts listed after
|
139
|
<B>exclude_hosts</B>
|
140
|
|
141
|
will be excluded from the service check.
|
142
|
<P>
|
143
|
<DT><B>exclude_period</B><I> periodspec</I>
|
144
|
|
145
|
<DD>
|
146
|
Do not run a scheduled monitor during the time
|
147
|
identified by
|
148
|
<I>periodspec</I>.
|
149
|
|
150
|
<P>
|
151
|
<DT><B>depend</B><I> dependexpression</I>
|
152
|
|
153
|
<DD>
|
154
|
The
|
155
|
<B>depend</B>
|
156
|
|
157
|
keyword is used to specify a dependency expression, which
|
158
|
evaluates to either true of false, in the boolean sense.
|
159
|
Dependencies are actual Perl expressions, and must obey all syntactical
|
160
|
rules. The expressions are evaluated in their own package space so as
|
161
|
to not accidentally have some unwanted side-effect.
|
162
|
If a syntax error is found when evaluating the expression, it
|
163
|
is logged via syslog.
|
164
|
<P>
|
165
|
Before evaluation, the following substitutions on the expression occur:
|
166
|
phrases which look like "group:service" are substituted with the value
|
167
|
of the current operational status of that specified service. These
|
168
|
opstatus substitutions are computed recursively, so if service A
|
169
|
depends upon service B, and service B depends upon service C, then
|
170
|
service A depends upon service C. Successful operational statuses (which
|
171
|
evaluate to "1") are "STAT_OK", "STAT_COLDSTART", "STAT_WARMSTART", and
|
172
|
"STAT_UNKNOWN". The word "SELF" (in all caps) can be used for the group
|
173
|
(e.g. "SELF:service"), and is an abbreviation for the current watch group.
|
174
|
<P>
|
175
|
This feature can be used to control alerts for services which are
|
176
|
dependent on other services, e.g. an SMTP test which is dependent upon
|
177
|
the machine being ping-reachable.
|
178
|
<P>
|
179
|
<DT><B>dep_behavior</B><I> {a|m}</I>
|
180
|
|
181
|
<DD>
|
182
|
The evaluation of dependency graphs
|
183
|
can control the
|
184
|
suppression of either alert or monitor invocations.
|
185
|
<P>
|
186
|
<B>Alert suppression</B>.
|
187
|
|
188
|
If this option is set to "a",
|
189
|
then the dependency expression
|
190
|
will be evaluated after the
|
191
|
monitor for the service exits or
|
192
|
after a trap is received.
|
193
|
An alert will only be sent
|
194
|
if the evaluation succeeds, meaning
|
195
|
that none of the nodes in the dependency
|
196
|
graph indicate failure.
|
197
|
<P>
|
198
|
<B>Monitor suppression</B>.
|
199
|
|
200
|
If it is set to "m",
|
201
|
then the dependency expression will be evaulated
|
202
|
before the monitor for the service is about to run.
|
203
|
If the evaulation succeeds, then the monitor
|
204
|
will be run. Otherwise, the monitor will not
|
205
|
be run and the status of the service will remain
|
206
|
the same.
|
207
|
<P>
|
208
|
</DL>
|
209
|
<A NAME="lbAO"> </A>
|
210
|
<H3>Period Definitions</H3>
|
211
|
|
212
|
<P>
|
213
|
Periods are used to define the conditions which
|
214
|
should allow alerts
|
215
|
to be delivered.
|
216
|
<P>
|
217
|
<DL COMPACT>
|
218
|
<DT><B>period</B><I> [label:] periodspec</I>
|
219
|
|
220
|
<DD>
|
221
|
A period groups one or more alarms and variables
|
222
|
which control how often an alert happens when there
|
223
|
is a failure.
|
224
|
The
|
225
|
<B>period</B>
|
226
|
|
227
|
keyword has two forms. The first
|
228
|
takes an argument which is a
|
229
|
period specification from Patrick Ryan's
|
230
|
Time::Period Perl 5 module. Refer to
|
231
|
"perldoc Time::Period" for more information.
|
232
|
<P>
|
233
|
The second form requires a label followed by a period specification, as
|
234
|
defined above. The label is a tag consisting of an alphabetic character
|
235
|
or underscore followed by zero or more alphanumerics or underscores
|
236
|
and ending with a colon. This
|
237
|
form allows multiple periods with the same period definition. One use
|
238
|
is to have a period definition which has no
|
239
|
<B>alertafter</B>
|
240
|
|
241
|
or
|
242
|
<B>alertevery</B>
|
243
|
|
244
|
parameters for a particular time period, and another
|
245
|
for the same time period with a different
|
246
|
set of alerts that does contain those
|
247
|
parameters.
|
248
|
<P>
|
249
|
<DT><B>alertevery</B><I> timeval</I>
|
250
|
|
251
|
<DD>
|
252
|
The
|
253
|
<B>alertevery</B>
|
254
|
|
255
|
keyword (within a
|
256
|
<B>period</B>
|
257
|
|
258
|
definition) takes the same type of argument as the
|
259
|
<B>interval</B>
|
260
|
|
261
|
variable, and limits the number of times an alert
|
262
|
is sent when the service continues to fail.
|
263
|
For example, if the interval is "1h", then only
|
264
|
the alerts in the period section will only
|
265
|
be triggered once every hour. If the
|
266
|
<B>alertevery</B>
|
267
|
|
268
|
keyword is
|
269
|
omitted in a period entry, an alert will be sent
|
270
|
out every time a failure is detected. By default,
|
271
|
if the output of two successive failures changes,
|
272
|
then the alertevery interval is overridden.
|
273
|
If the word
|
274
|
"summary" is the last argument, then only the summary
|
275
|
output lines will be considered when comparing the
|
276
|
output of successive failures.
|
277
|
<P>
|
278
|
<DT><B>alertafter</B><I> num</I>
|
279
|
|
280
|
<DD>
|
281
|
<P>
|
282
|
<DT><B>alertafter</B><I> num timeval</I>
|
283
|
|
284
|
<DD>
|
285
|
The
|
286
|
<B>alertafter</B>
|
287
|
|
288
|
keyword (within a
|
289
|
<B>period</B>
|
290
|
|
291
|
section) has two forms: only with the "num"
|
292
|
argument, or with the "num timeval" arguments.
|
293
|
In the first form, an alert will only be invoked
|
294
|
after "num" consecutive failures.
|
295
|
<P>
|
296
|
In the second form,
|
297
|
the arguments are a positive integer followed by an interval,
|
298
|
as described by the
|
299
|
<B>interval</B>
|
300
|
|
301
|
variable above.
|
302
|
If these parameters are specified,
|
303
|
then the alerts for that period will only
|
304
|
be called after that many failures happen
|
305
|
within that interval. For example,
|
306
|
if
|
307
|
<B>alertafter</B>
|
308
|
|
309
|
is given the arguments "3 30m", then the alert will be called
|
310
|
if 3 failures happen within 30 minutes.
|
311
|
<P>
|
312
|
<DT><B>numalerts</B><I> num</I>
|
313
|
|
314
|
<DD>
|
315
|
<P>
|
316
|
This variable tells the server to call no more than
|
317
|
<I>num</I>
|
318
|
|
319
|
alerts during a
|
320
|
failure. The alert counter is kept on a per-period basis,
|
321
|
and is reset upon each success.
|
322
|
<P>
|
323
|
<DT><B>comp_alerts</B>
|
324
|
|
325
|
<DD>
|
326
|
<P>
|
327
|
If this option is specified, then upalerts will only be
|
328
|
called if a corresponding "down" alert has been called.
|
329
|
<P>
|
330
|
<DT><B>alert</B><I> alert [arg...]</I>
|
331
|
|
332
|
<DD>
|
333
|
A period may contain multiple alerts, which are triggered
|
334
|
upon failure of the service. An alert is specified with
|
335
|
the
|
336
|
<B>alert</B>
|
337
|
|
338
|
keyword, followed by an optional
|
339
|
<B>exit</B>
|
340
|
|
341
|
parmeter, and arguments which are interpreted the same as
|
342
|
the
|
343
|
<B>monitor</B>
|
344
|
|
345
|
definition, but without the ";;" exception. The
|
346
|
<B>exit</B>
|
347
|
|
348
|
parameter takes the form of
|
349
|
<B>exit=x</B>
|
350
|
|
351
|
or
|
352
|
<B>exit=x-y</B>
|
353
|
|
354
|
and has the effect that the alert is only called if the
|
355
|
exit status of the monitor script falls within the range
|
356
|
of the
|
357
|
<B>exit</B>
|
358
|
|
359
|
parameter. If, for example, the alert line is
|
360
|
<I>alert exit=10-20 mail.alert mis</I>
|
361
|
|
362
|
then
|
363
|
<I>mail-alert</I>
|
364
|
|
365
|
will only be invoked with
|
366
|
<I>mis</I>
|
367
|
|
368
|
as its arguments if the monitor
|
369
|
program's exit value is between 10 and 20. This feature
|
370
|
allows you to trigger different alerts at different
|
371
|
severity levels (like when free disk space goes from 8% to 3%).
|
372
|
<P>
|
373
|
See the
|
374
|
<B>ALERT PROGRAMS</B>
|
375
|
|
376
|
section above for a list of the pramaeters mon will pass
|
377
|
automatically to alert programs.
|
378
|
<P>
|
379
|
<DT><B>upalert</B><I> alert [arg...]</I>
|
380
|
|
381
|
<DD>
|
382
|
An
|
383
|
<B>upalert</B>
|
384
|
|
385
|
is the compliment of an
|
386
|
<B>alert</B>.
|
387
|
|
388
|
An upalert is called when a services makes the state transition from
|
389
|
failure to success. The
|
390
|
<B>upalert</B>
|
391
|
|
392
|
script is called supplying
|
393
|
the same parameters as the
|
394
|
<B>alert</B>
|
395
|
|
396
|
script, with the addition of the
|
397
|
<B>-u</B>
|
398
|
|
399
|
parameter which is simply used to let
|
400
|
an alert script know that it is being called
|
401
|
as an upalert. Multiple upalerts may be
|
402
|
specified for each period definition.
|
403
|
Please note that the default behavior is that
|
404
|
an upalert will be sent
|
405
|
regardless if there were any prior "down" alerts
|
406
|
sent, since upalerts are triggered on a state
|
407
|
transition. Set the per-period
|
408
|
<B>comp_alerts</B>
|
409
|
|
410
|
option to pair upalerts with "down" alerts.
|
411
|
<P>
|
412
|
<DT><B>startupalert</B><I> alert [arg...]</I>
|
413
|
|
414
|
<DD>
|
415
|
A
|
416
|
<B>startupalert</B>
|
417
|
|
418
|
is only called when the
|
419
|
<B>mon</B>
|
420
|
|
421
|
server starts execution.
|
422
|
<P>
|
423
|
<DT><B>upalertafter</B><I> timeval</I>
|
424
|
|
425
|
<DD>
|
426
|
The
|
427
|
<B>upalertafter</B>
|
428
|
|
429
|
parameter is specified as a string that
|
430
|
follows the syntax of the
|
431
|
<B>interval</B>
|
432
|
|
433
|
parameter ("30s", "1m", etc.), and
|
434
|
controls the triggering of an
|
435
|
<B>upalert</B>.
|
436
|
|
437
|
If a service comes back up after
|
438
|
being down for a time greater than
|
439
|
or equal to the value of this option, an
|
440
|
<B>upalert</B>
|
441
|
|
442
|
will be called. Use this option to prevent
|
443
|
upalerts to be called because of "blips" (brief outages).
|
444
|
<P>
|