One of the biggest misconceptions of all time in networking is the use of a traceroute to determine that your communication with a server has high latency. On windows, traceroute is the same command as tracert.
Many people beleive that when they see high latency such as 250ms+ in a single hop of a trace-route that it means that that device in the transit path is responsible for the degraded network performance when in fact it could not be more further from the truth.
First lets look at how ping works.
PING, is an application based on the ICMP protocol which is used to send echo packets to a destination and expecting to receive an echo response and it calculates the RTT (Round Trip Time) from when the packet was sent to when the echo response was received. Generally when using PING on a LAN network you can trust that what it is saying is accurate unless you have foreknowledge of network devices in the transit path that prioritize ICMP over mission critical TCP/UDP Traffic. This however is very common in networks that utilize unified communications, meaning voice and data on the same network. This is because QoS Policies are put in place to ensure voice traffic and other mission critical traffic is prioritized over ICMP thus indirectly affecting the RTT time of an ICMP ping test.
Trace-route is another method commonly used by technicians and engineers to diagnosis latency in the transit path however any engineer that has studied how trace-route works would know that its results are nearly always misleading.
Trace-route works in a manner similar to ping however it uses the TTL feature to make each successive hop in the transit path respond with an ICMP TTL Expired packet. Thus gives you the ability to determine which network devices the ICMP packet is traversing.
When you dig deeper into the operation of traceroute you will see that traceroute utilizes 3 probe packets for each successive hop by default unless you specify other wise. Each probe packet indirectly measures the latency between the source and the device where the TTL is declared expired. This latency calculation is a by product of its true intended purpose. Keep in mind even if you send probes to a device that is five hops away, random latency spikes in any four devices prior to the fifth hop can result in the fifth hop looking like it has high latency.
Also note that any Control Plane Policing policy enforced on any device in the transit path could result in ICMP being prioritized to the control plane of the transit device. ICMP is processed switched by most devices whereas TCP/UDP is express forwarded.
An example below of a traceroute on Windows 7;
C:\>tracert www.google.com -d Tracing route to www.google.com [74.125.225.113] over a maximum of 30 hops: 1 1 ms <1 ms <1 ms 10.100.38.2 2 1 ms 1 ms <1 ms 209.51.231.145 3 5 ms 4 ms 3 ms 64.65.234.204 4 7 ms 7 ms 7 ms 64.69.98.140 5 29 ms 29 ms 29 ms 64.69.97.217 6 30 ms 29 ms 29 ms 64.69.97.219 7 31 ms 31 ms 32 ms 128.242.186.161 8 30 ms 30 ms 29 ms 129.250.197.146 9 30 ms 29 ms 30 ms 209.85.254.120 10 33 ms 30 ms 30 ms 209.85.240.150 11 29 ms 30 ms 29 ms 74.125.225.113 Trace complete. C:\>
You can see from the trace route shown above that there is 3 probes per hop between the source and destination and that it does not appear to have latency until traffic traverses 64.69.97.217
The whole point of this blog is to teach you how to interpret such data. Just because you see a spike in latency on the 5th hop does not mean that the 5th hop is causing latency. It can easily mean that the control plane in the device on the fifth hop is under marginal load and that the processor does not respond to the ICMP immediately due to other processes with priority.
Just because you see potential latency with trace-route, you should never expect that to be an accurate representation of latency for TCP/UDP traffic because ICMP and TCP/UDP traffic is treated completely different when it comes to the routers control/forwarding planes.
Most ISP’s use control-plane policing (CoPP) to prevent overwhelming ICMP flooding to a devices control plane. This type of flood prevention mechanism can also result in skewed data in trace routes.
Shown below is a simple CoPP Policy which can result in skewed trace route data.
! class-map match-all Catch-All-IP match access-group 124 class-map match-all Management match access-group 121 class-map match-all Normal match access-group 122 class-map match-all Undesirable match access-group 123 class-map match-all Routing match access-group 120 ! policy-map RTR_CoPP class Undesirable police 8000 1500 1500 conform-action drop exceed-action drop class Routing police 1000000 50000 50000 conform-action transmit exceed-action transmit class Management police 100000 20000 20000 conform-action transmit exceed-action drop class Normal police 50000 5000 5000 conform-action transmit exceed-action drop class Catch-All-IP police 50000 5000 5000 conform-action transmit exceed-action drop class class-default police 8000 1500 1500 conform-action transmit exceed-action transmit ! access-list 120 permit tcp any gt 1024 10.0.1.0 0.0.0.255 eq bgp access-list 120 permit tcp any eq bgp 10.0.1.0 0.0.0.255 gt 1024 established access-list 120 permit tcp any gt 1024 10.0.1.0 0.0.0.255 eq 639 access-list 120 permit tcp any eq 639 10.0.1.0 0.0.0.255 gt 1024 established access-list 120 permit tcp any 10.0.1.0 0.0.0.255 eq 646 access-list 120 permit udp any 10.0.1.0 0.0.0.255 eq 646 access-list 120 permit ospf any 10.0.1.0 0.0.0.255 access-list 120 permit ospf any host 224.0.0.5 access-list 120 permit ospf any host 224.0.0.6 access-list 120 permit eigrp any 10.0.1.0 0.0.0.255 access-list 120 permit eigrp any host 224.0.0.10 access-list 121 permit tcp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq telnet access-list 121 permit tcp 10.0.2.0 0.0.0.255 eq telnet 10.0.1.0 0.0.0.255 established access-list 121 permit tcp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq 22 access-list 121 permit tcp 10.0.2.0 0.0.0.255 eq 22 10.0.1.0 0.0.0.255 established access-list 121 permit udp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq snmp access-list 121 permit tcp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq www access-list 121 permit udp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq 443 access-list 121 permit tcp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq ftp access-list 121 permit tcp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq ftp-data access-list 121 permit udp 10.0.2.0 0.0.0.255 10.0.1.0 0.0.0.255 eq syslog access-list 121 permit udp 10.0.3.0 0.0.0.255 eq domain 10.0.1.0 0.0.0.255 access-list 121 permit udp 10.0.4.0 0.0.0.255 10.0.1.0 0.0.0.255 eq ntp access-list 122 permit icmp any 10.0.1.0 0.0.0.255 echo access-list 122 permit icmp any 10.0.1.0 0.0.0.255 echo-reply access-list 122 permit icmp any 10.0.1.0 0.0.0.255 ttl-exceeded access-list 122 permit icmp any 10.0.1.0 0.0.0.255 packet-too-big access-list 122 permit icmp any 10.0.1.0 0.0.0.255 port-unreachable access-list 122 permit icmp any 10.0.1.0 0.0.0.255 unreachable access-list 122 permit pim any any access-list 122 permit udp any any eq pim-auto-rp access-list 122 permit igmp any any access-list 122 permit gre any any access-list 123 permit icmp any any fragments access-list 123 permit udp any any fragments access-list 123 permit tcp any any fragments access-list 123 permit ip any any fragments access-list 123 permit udp any any eq 1434 access-list 123 permit tcp any any eq 639 rst access-list 123 permit tcp any any eq bgp rst access-list 124 permit tcp any any access-list 124 permit udp any any access-list 124 permit icmp any any access-list 124 permit ip any any ! control-plane service-policy input RTR_CoPP !
If you examine the CoPP policy in detail you will notice that all ICMP destined to the control plane is limited to 50000bps as shown below. It can bust up to 5000bps and if it conforms to the policy it is transmited, if it exceeds the policy the ICMP is dropped.
class Catch-All-IP police 50000 5000 5000 conform-action transmit exceed-action drop
With this in mind you should always use trace route for its intended purpose which is to determine the route traffic takes when traversing the transit path and that latency shown on the per hop probe basis is to be taken with at grain of salt when traversing public devices.
The intended purpose of the 3 probe count is to determine if the traffic traverses multiple routed paths due to route engineering, not to determine the latency 3 times.
I will conclude this blog with the pathping command. This command found on windows is a command similar to traceroute but it combines traceroute with ping to give you a better understanding of latency in the transit path.
Pathping works first by doing a traceroute to the destination then it uses ICMP to ping each hop in the transit path 100 times. This is used to verify latency between the source and destination via icmp echo per each hop. But remember what I said earlier, you cannot rely ICMP when public devices are involved. So you can run into cases where you see ICMP pings destined to one hop in the transit drop 40% of the traffic whereas the next hop has 100% success rate. This is due to CoPP.
Pathping in general is a much better tool to diagnosis latency from a specific source to destination with a relative degree of accuracy. Note that I said Relative, this is because latency is ALWAYS relative to your location on the network.
Shown below is an example of pathping in the works;
C:\>pathping www.google.com -n Tracing route to www.google.com [74.125.225.116] over a maximum of 30 hops: 0 10.100.38.162 1 10.100.38.2 2 209.51.231.145 3 64.65.234.204 4 64.69.98.171 5 64.69.99.238 6 165.121.238.178 7 64.214.141.253 8 67.16.132.174 9 72.14.218.13 10 72.14.238.232 11 72.14.236.206 12 216.239.46.215 13 72.14.237.132 14 209.85.240.150 15 74.125.225.116 Computing statistics for 375 seconds... Source to Here This Node/Link Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address 0 10.100.38.162 0/ 100 = 0% | 1 1ms 0/ 100 = 0% 0/ 100 = 0% 10.100.38.2 0/ 100 = 0% | 2 0ms 0/ 100 = 0% 0/ 100 = 0% 209.51.231.145 0/ 100 = 0% | 3 4ms 0/ 100 = 0% 0/ 100 = 0% 64.65.234.204 0/ 100 = 0% | 4 6ms 0/ 100 = 0% 0/ 100 = 0% 64.69.98.171 0/ 100 = 0% | 5 22ms 0/ 100 = 0% 0/ 100 = 0% 64.69.99.238 0/ 100 = 0% | 6 10ms 0/ 100 = 0% 0/ 100 = 0% 165.121.238.178 0/ 100 = 0% | 7 34ms 0/ 100 = 0% 0/ 100 = 0% 64.214.141.253 0/ 100 = 0% | 8 37ms 0/ 100 = 0% 0/ 100 = 0% 67.16.132.174 0/ 100 = 0% | 9 35ms 0/ 100 = 0% 0/ 100 = 0% 72.14.218.13 0/ 100 = 0% | 10 --- 100/ 100 =100% 100/ 100 =100% 72.14.238.232 0/ 100 = 0% | 11 --- 100/ 100 =100% 100/ 100 =100% 72.14.236.206 0/ 100 = 0% | 12 --- 100/ 100 =100% 100/ 100 =100% 216.239.46.215 0/ 100 = 0% | 13 --- 100/ 100 =100% 100/ 100 =100% 72.14.237.132 0/ 100 = 0% | 14 --- 100/ 100 =100% 100/ 100 =100% 209.85.240.150 0/ 100 = 0% | 15 36ms 0/ 100 = 0% 0/ 100 = 0% 74.125.225.116 Trace complete. C:\>
As you can see from the pathping shown above there are some hops in the transit path that completely drop ICMP. You can also notice that the latency to hop 5 is higher then the latency is to hop 6. This shows that either Control Plane Policing is used on 64.69.99.238 or the process utilization on hop 5 is relatively higher.
You should know that there are other tools out there that are extremely useful when trying to diagnosis latency related problems. Most of these tools rely on ICMP and your decision to trust them is based on your understanding the transit path. One of these tools being Ping Plotter. There are several useful tools included in the Solarwinds Engineers Toolset however this toolset is extremely expensive. You can download a trial and check it out at Solarwinds Engineers Toolset
The most accurate tools depend on TCP however since TCP is a connection oriented protocol, both the source and destination must be willing to participate in the testing. Some tools are hardware based such as the Fluke Network EtherScope which cost several thousand dollars.
So in conclusion, your decision to trust and use data from ICMP based troubleshooting should be based on your relative understanding of the transit path. You should never take a traceroute that has high latency on it and say its a network issue just because hope 7 has latency greater then 250ms. This is no different the a doctor telling you your spleen is the result of your headaches without factual basis.
If you do not have clear factual data when diagnosing a problem and you blame the network because of a traceroute, you may very well be completely missing the root cause of the problem. Think of it as getting tunneled vision when sh!t hits the fan and management is expecting answers and the first thing you notice is high latency on a traceroute. With out completely understanding traceroute you may be fixating on an issue that is really not an issue at all.
Recent Comments