* removed for unloading on a specific person, and apparently misunderstand what the hell the post meant *
Updated by: Cary Wiedemann at Oct 07 00:01
Hello again David,
First please let me again apologize for the delay in communicating our furthering efforts to stamp out this trouble completely. Please be assured that we have been and will continue to actively work on this issue until it is no longer service impacting.
Additionally, I have just read over this entire case from the beginning and agree that the support you've received has been dismal. I assure you that the service you've received on this ticket is far from our usual prompt and thorough solutions. The cause for the trouble in this ticket is caused by the multitude of perplexing factors that this individual case presents, namely intermittent networking issues from certain locations at specific times. This is certainly not an excuse for the way this ticket has been handled but please understand that the myriad of symptoms and involvement of our core networking equipment has result! ed in a greatly delayed resolution.
For the past few weeks I have been monitoring our service reliability at the "NYIIX" internet exchange in New York City. This has been the principal focus of the initial investigation as it appeared that our connection from New York, NY to McLean, VA (where your server is located) was being saturated. The traceroutes that both you and your users have provided seemed to confirm this. However, my week-long ping tests to the ISC's (Internet Systems Consortium) NYIIX housed router showed not a single packet exceeded 50ms of latency for the entirety of the test (over one week and 600,000 packets). As this pipe is the same one shared by all other NYIIX peering it seems that something larger may be occurring.
I have also configured a "ping monitor" on your server and have the results currently being sent to my personal email account (and by extension my BlackBerry) in order to attempt to catch any other troubles in real tim! e to be able to investigate what may be occurring. This far t! his moni tor has produced no alerts and has been configured since 10/1. You can see this monitor and configure additional monitors by visiting the "Server Monitor" section of your myCP control panel.
I have just also gone back through the '
SolapTraces@yahoo.com' email box and the two relevant forum threads in order to re-analyze any and all traceroutes that were provided. Out of the numerous traceroutes provided only a small fraction were both usable and showed the trouble "in action." I will reproduce several of these below:
This traceroute is among the most damning as it clearly shows a spike at our NYIIX/DCA2 hop and fluctuating latency after that point:
1 47 ms 47 ms 47 ms 192.168.1.99
2 58 ms 57 ms 9 ms tkueur1.fi.elisa.net [85.156.192.1]
3 56 ms 53 ms 54 ms ge1-1-2.tkutur-p1.fi.elisa.net [139.97.9.18]
4 56 ms 67 ms 56 ms so5-3-0.helpa-p1.fi.elisa.net [139.97.29.137]
5 55 ms ! 56 ms 55 ms ae2.heltli-gw1.fi.elisa.net [139.97.6.246]
6 55 ms 55 ms 55 ms ae1-10.bbr2.hel1.fi.eunetip.net [213.192.191.49]
7 165 ms 164 ms 118 ms so2-3-0-0.bbr1.nyc1.us.eunetip.net [213.192.191.174]
8 790 ms 911 ms 968 ms nyc1.ge11-2.core2.dca2.hopone.net [66.36.224.209]
9 170 ms 169 ms 224 ms vl3.msfc1.distb2.dca2.hopone.net [66.36.224.245]
10 216 ms 172 ms 236 ms 66.36.240.92
This next traceroute appears to be very similar to the previous, but upon closer inspection the connection "all the way through" to your server seems to be optimal. With an actual sustained network event all hops after a problematic router will carry the latency of the affected hop, along with any new delays incurred. This traceroute seems to suggest that the delay was incurred by how long it took the router to process the UDP ping (traceroute ping) response as opposed to how long it took to forward the packet.! If the latency was actually 800ms on hop #7 hops 8 and 9 sho! uld show a minimum of 800ms:
1
2 80 ms 79 ms 84 ms cr1.cmpri.uk.easynet.net [87.87.251.186]
3 75 ms 76 ms 74 ms 80.238.46.161
4 92 ms 95 ms 82 ms bu4.er10.txlon.ov.easynet.net [89.200.135.142]
5 81 ms 82 ms 81 ms bu4.gr10.bllon.uk.easynet.net [89.200.135.143]
6 149 ms 149 ms 150 ms te0-0-0-1.gr10.bwnyc.us.easynet.net [87.86.77.105]
7 * 807 ms 803 ms nyc1.ge11-2.core2.dca2.hopone.net [66.36.224.209]
8 157 ms 155 ms 157 ms vl2.msfc1.distb1.dca2.hopone.net [66.36.224.228]
9 155 ms 157 ms 155 ms 66.36.240.92
The next traceroute is nearly identical to the one above:
1 1 ms 1 ms 1 ms 192.168.1.3
2 3 ms 1 ms 1 ms 192.168.0.1
3 27 ms 39 ms 26 ms fe0-0-c5.BG.YU.yubc.net [212.124.160.37]
4 26 ms 26 ms 32 ms ge-0-2-0-0-j0.BG.YU.yubc.net [212.124.160.62]
5 * 68 ms 29 ms YUBC-M10.telekom.yu [! 195.178.34.21]
6 29 ms 30 ms 35 ms 212.200.232.57
7 29 ms 23 ms 23 ms 212.200.227.249
8 83 ms 37 ms 31 ms PO9-0.bud-001-access-300.interoute.net [84.233.170.165]
9 132 ms 142 ms * xe-3-1-0-0.bud-001-score-1-re0.interoute.net [84.233.147.93]
10 131 ms 154 ms 129 ms ae2-0.prg-001-score-2-re0.interoute.net [84.233.138.213]
11 899 ms 132 ms 187 ms ae0-0.prg-001-score-1-re0.interoute.net [84.233.138.205]
12 131 ms 128 ms 141 ms ae2-0.fra-006-score-2-re0.interoute.net [84.233.138.210]
13 148 ms 129 ms 133 ms ae0-0.fra-006-score-1-re0.interoute.net [84.233.207.93]
14 131 ms 128 ms 135 ms ae1-0.ams-koo-score-2-re0.interoute.net [84.233.190.49]
15 128 ms 128 ms 132 ms ae0-0.ams-koo-score-1-re0.interoute.net [84.233.190.1]
16 133 ms 128 ms 130 ms ae1-0.lon-001-score-1-re0.interoute.net [84.233.190.58]
17 131 ms 127 ms 133 ms Gi0-0-0.lon-001-access-2.interoute.net [84.233.218.162]
18 133 ms 129 ms 127 ms PO6-0.nyc-002-access-1.intero! ute.net [212.23.43.149]
19 131 ms 129 ms 129 ms Gi7-0.nyc-! 002-acce ss-3.interoute.net [212.23.43.138]
20 781 ms 777 ms 780 ms nyc1.ge11-2.core2.dca2.hopone.net [66.36.224.209]
21 134 ms 134 ms 135 ms vl3.msfc1.distb1.dca2.hopone.net [66.36.224.244]
22 138 ms 135 ms 135 ms 66.36.240.92
As is this one:
1 2 ms 1 ms 2 ms 192.168.1.1
2 5 ms 2 ms 2 ms 213.101.209.65
3 17 ms 4 ms 3 ms htg0-ncore-1.gigabiteth1-4.swip.net [130.244.205.125]
4 2 ms 2 ms 2 ms htg0-core-1.tengigabiteth1-0-0.swip.net [130.244.52.129]
5 2 ms 2 ms 2 ms kst-core-1.tengigabiteth8-0-0.swip.net [130.244.218.154]
6 8 ms 8 ms 8 ms gbg-core-1.pos8-0-0.swip.net [130.244.39.142]
7 25 ms 23 ms 23 ms 130.244.205.150
8 98 ms 99 ms 98 ms nyc9-core-1.pos8-0-0.swip.net [130.244.218.214]
9 777 ms 752 ms 746 ms nyc1.ge11-2.core2.dca2.hopone.net [66.36.224.209]
10 107 ms 105 ms 105 ms vl2.msfc1.dis! tb2.dca2.hopone.net [66.36.224.229]
11 106 ms 106 ms 105 ms 66.36.240.92
By far the most telling and interesting traceroute I have seen thus far has to be this last one:
1 14 ms 9 ms 8 ms 10.42.64.1
2 22 ms 19 ms 11 ms osr01sand-v15.network.virginmedia.net [62.30.254.161]
3 17 ms 18 ms 18 ms osr02wolv-tenge74.network.virginmedia.net [62.30.254.77]
4 21 ms 15 ms 21 ms win-bb-b-ge-300-0.network.virginmedia.net [195.182.178.69]
5 21 ms 21 ms 20 ms gfd-bb-a-so-120-0.network.virginmedia.net [212.43.162.205]
6 24 ms 25 ms 21 ms gfd-bb-b-ae0-0.network.virginmedia.net [213.105.172.6]
7 19 ms 17 ms 21 ms redb-ic-1-as0-0.network.virginmedia.net [62.253.185.78]
8 114 ms 106 ms 108 ms ge1-1-9.core1.iad1.hopone.net [66.36.224.129]
9 757 ms 757 ms 769 ms ge11-1.core2.dca2.hopone.net [66.36.224.53]
10 124 ms ! 117 ms 105 ms vl3.msfc1.distb1.dca2.hopone.net [66.36.224.! 244] 11 110 ms 116 ms 105 ms 66.36.240.92
In this particular traceroute the packets enter our network at the hop "ge1-1-9.core1.iad1.hopone.net" which is a completely separate router in Ashburn, VA. From here they are sent (approximately 20 miles) to our router in McLean, VA. The trip from our router in Ashburn, VA to McLean, VA still seems to produce latency identically large to that which comes over the NYIIX connection, but this particular path doesn't go through New York at all.
The affected hop in this example is: ge11-1.core2.dca2.hopone.net and the affected hop in all of the other examples nyc1.ge11-2.core2.dca2.hopone.net
The "ge11" in this string means "card #11" in router core2.dca2.hopone.net, which happens to be a 3x Gigabit Ethernet card. It is possible that there is something physically wrong with this card that only manifests itself when certain other conditions are met. I have already started running advanced diagnostics o! n this aspect and hope to have more information by tomorrow.
Please note that every affected traceroute seems to use private peering contacts instead of global Tier 1 providers. For example all traces from Level3, GLBX, or other tier 1 transit providers which hand off directly to our core routers are unaffected. The specific peering sessions which seem to be affected by this are with:
easynet.net
eunetip.net
virginmedia.net
interoute.net
swip.net
If we cannot quickly resolve this trouble we can certainly start removing peering sessions from our NYIIX based router. This will cause a small increase in latency (as the routing will no longer connect directly into our network but rather force traffic to a global tier 1 transit provider) but should completely eliminate the instability and packet loss.
Before proceeding further, however, I would like to give an opportunity for the advanced diagnostics to run on our Cisco router. I! should have more detail tomorrow.
Unfortunately the ! other tr ouble you've experienced (drops from the East Coast, etc.) seem to be completely unrelated to this "European peering" trouble and will need to be investigated separately.
Again please let me apologize for the delay in sending this response. We have been trying many different methods to alleviate your trouble without taking extreme measures, but it is now obvious that extreme measures will be required.
This ticket will certainly be left open and another update will be coming shortly.
As always if you have any further questions or requests please don't hesitate to ask.
Thanks!
- Cary
Updated by: Customer at Oct 02 19:14
Cary
Any additional news?
Updated by: Customer at Sep 29 01:15
Cary
Thanks for the update, and I just reread the entire matter from the beginning.
I am honestly hearing nothing but endless grief on my side from the more vocal users, and the frustration! has leaked onto the help ticket and unfortunately you to as well. So fwiw to you I'm a bit red faced that I didn't keep a wall between them and the issue ticket in my discussions with you. My sincere apologizes for anything that made your job harder.
The kind words are still meant believe it or not, and I want to personally thank you for the efforts expended on one llittle server, in one little rack, in one room, of your company.
*sigh*
The complaints have expanded to include packet lose, and a "bounce" in ping times for all including North America. The users are experiencing a fluctuation from, for example, 40 to 180 ms randomingly for 30 to 60 secs and the duration is sometimes for hours. Those with ping times as low as 10 ms (holy crap 10 ms!) can not connect to the server, or time off.
I am aware that many times the issue is on their end with thier ISP and packet shaping. I'm also aware that 100ms pings are far from bad coming out o! f Seattle to VA. In addition the CPU usage was spiking the ot! her nigh t while I was idling on the server watching performance... and this should not be happening given past experiences with the applications and services running on the server. While no major update went live recently I can not swear to any minor changes, and will look into the matter further on my end.
Im letting you know this in a share the information vein and full disclosure.
I look forward to your diagnosis
David