Essential Network Metrics Every Admin Should Track
Bandwidth, packet loss, latency, and errors. Which metrics matter and how to set meaningful alert thresholds.
Bandwidth Utilization
Bandwidth utilization shows how much of your link capacity is in use. It's the most common metric but often misunderstood.
Inbound (Rx)
Traffic coming into the interface. For uplinks, this is typically download traffic from the internet or upstream network.
Outbound (Tx)
Traffic leaving the interface. For server connections, this is often higher than inbound due to response data.
Alert Thresholds
- Warning: 70% utilization sustained for 5 minutes
- Critical: 90% utilization sustained for 2 minutes
Don't alert on brief spikes. A link hitting 100% for 30 seconds during a backup window is normal. Sustained high utilization indicates capacity problems.
Interface Errors
Error counters reveal physical layer problems that don't always cause visible outages but degrade performance.
CRC Errors
CriticalFrame check failures. Usually indicates cable problems, bad SFPs, or duplex mismatches. Any sustained CRC errors need investigation.
Input Errors
WarningPackets received with errors. Can include CRC, runts, giants, and other malformed frames. Track the trend, not absolute count.
Output Drops
WarningPackets dropped due to queue overflow. Indicates congestion. Common during traffic bursts but sustained drops mean undersized buffers or links.
Tip: Monitor error rates (errors per minute), not cumulative counters. Counters only go up, making it hard to see when problems resolve.
Packet Loss
Packet loss directly impacts user experience. Even 1% loss can make VoIP calls sound terrible and cause TCP retransmissions that slow applications.
| Loss Rate | Impact |
|---|---|
| < 0.1% | Normal, not noticeable |
| 0.1% - 1% | VoIP quality degradation, minor slowdowns |
| 1% - 5% | Significant performance impact, user complaints |
| > 5% | Severe degradation, connections may fail |
Measure packet loss with synthetic probes - ICMP ping or dedicated monitoring packets. SNMP counters show interface-level discards, but don't capture end-to-end loss.
Latency
Latency measures the time for packets to travel between points. Low latency matters for real-time applications and user-perceived responsiveness.
Track both average latency and variance (jitter). High jitter causes problems even when average latency is acceptable. VoIP and video are especially sensitive to jitter.
# Key latency metrics rtt_avg: 45ms # Average round-trip time rtt_min: 42ms # Best case rtt_max: 68ms # Worst case jitter: 12ms # Variation between samples
Device Health Metrics
Beyond interface metrics, monitor the health of the device itself.
CPU Utilization
Sustained high CPU can cause packet drops and slow management response.
Memory Usage
Low memory causes routing table issues and management plane problems.
Temperature
High temps indicate cooling failures or environmental issues.
Uptime
Unexpected reboots indicate crashes or power issues.
Setting Meaningful Thresholds
Threshold values depend on your environment. A 70% utilization alert makes sense for a 10 Gbps uplink but is too sensitive for a 100 Mbps branch office link that regularly bursts.
- 1 Baseline first: Collect 2-4 weeks of data before setting alerts. Understand normal patterns.
- 2 Use percentiles: Alert when current value exceeds 95th percentile of historical data.
- 3 Require duration: Don't alert on instantaneous spikes. Require the condition to persist.
- 4 Review regularly: Thresholds that made sense last year may not fit current traffic patterns.