Essential Network Monitoring Metrics

Bandwidth Utilization

Bandwidth utilization shows how much of your link capacity is in use. It's the most common metric but often misunderstood.

Inbound (Rx)

Traffic coming into the interface. For uplinks, this is typically download traffic from the internet or upstream network.

Outbound (Tx)

Traffic leaving the interface. For server connections, this is often higher than inbound due to response data.

Alert Thresholds

Warning: 70% utilization sustained for 5 minutes
Critical: 90% utilization sustained for 2 minutes

Don't alert on brief spikes. A link hitting 100% for 30 seconds during a backup window is normal. Sustained high utilization indicates capacity problems.

Interface Errors

Error counters reveal physical layer problems that don't always cause visible outages but degrade performance.

CRC Errors

Critical

Frame check failures. Usually indicates cable problems, bad SFPs, or duplex mismatches. Any sustained CRC errors need investigation.

Input Errors

Warning

Packets received with errors. Can include CRC, runts, giants, and other malformed frames. Track the trend, not absolute count.

Output Drops

Warning

Packets dropped due to queue overflow. Indicates congestion. Common during traffic bursts but sustained drops mean undersized buffers or links.

Tip: Monitor error rates (errors per minute), not cumulative counters. Counters only go up, making it hard to see when problems resolve.

Packet Loss

Packet loss directly impacts user experience. Even 1% loss can make VoIP calls sound terrible and cause TCP retransmissions that slow applications.

Loss Rate	Impact
< 0.1%	Normal, not noticeable
0.1% - 1%	VoIP quality degradation, minor slowdowns
1% - 5%	Significant performance impact, user complaints
> 5%	Severe degradation, connections may fail

Measure packet loss with synthetic probes - ICMP ping or dedicated monitoring packets. SNMP counters show interface-level discards, but don't capture end-to-end loss.

Latency

Latency measures the time for packets to travel between points. Low latency matters for real-time applications and user-perceived responsiveness.

< 50ms

LAN / Local

50-150ms

Regional WAN

> 150ms

Intercontinental

Track both average latency and variance (jitter). High jitter causes problems even when average latency is acceptable. VoIP and video are especially sensitive to jitter.

# Key latency metrics
rtt_avg: 45ms     # Average round-trip time
rtt_min: 42ms     # Best case
rtt_max: 68ms     # Worst case
jitter: 12ms      # Variation between samples

Device Health Metrics

Beyond interface metrics, monitor the health of the device itself.

CPU Utilization

Sustained high CPU can cause packet drops and slow management response.

Alert: > 80% for 5 minutes

Memory Usage

Low memory causes routing table issues and management plane problems.

Alert: > 90% utilization

Temperature

High temps indicate cooling failures or environmental issues.

Alert: Vendor-specific thresholds

Uptime

Unexpected reboots indicate crashes or power issues.

Alert: Uptime reset unexpectedly

Setting Meaningful Thresholds

Threshold values depend on your environment. A 70% utilization alert makes sense for a 10 Gbps uplink but is too sensitive for a 100 Mbps branch office link that regularly bursts.

1 Baseline first: Collect 2-4 weeks of data before setting alerts. Understand normal patterns.
2 Use percentiles: Alert when current value exceeds 95th percentile of historical data.
3 Require duration: Don't alert on instantaneous spikes. Require the condition to persist.
4 Review regularly: Thresholds that made sense last year may not fit current traffic patterns.