on 14-Jul-2011 16:59
Having worked on a few large Lync deployments recently, I have realized that there is still a lot of confusion around properly architecting the network for load balancing Lync Edge Servers. Guidance on this subject has changed from OCS 2007 to OCS 2007 R2 and now to Lync Server 2010, and it's important that care is taken while planning the design. It's also important to know that although a certain architecture may seem to work, it could be very far from best practice. I'll explain what I mean by that below.
The main purpose of Edge Services is to allow remote (whether they are corporate, anonymous, federated, etc) users to communicate with other external/internal users and vice versa. If you're looking to extend your Lync deployment to support communication with federated partners, public IM services, remote users and such, then you'll want to make sure you deploy your Edge Servers properly.
This post will discuss some requirements and best practices for deploying Edge Servers, and then we'll go into some suggested architectures. For this discussion, let's assume that there are 3 device types within your DMZ; your firewall, your BIG-IP LTM, and your Lync Edge Server farm.
Requirement 1: Your Edge Servers need at least 2 network interfaces; one or more dedicated to the external network, and one dedicated to the internal. The external and internal interfaces need to be on separate IP networks.
The Edge Server will host 3 separate external services; Access, Web Conferencing, and Audio/Visual (A/V). If you plan on exposing all 3 services for remote users, you have a choice of using one IP for all 3 services on each server and differentiate them by TCP/UDP port value, or go with a separate IP for each service and use standard ports.
Best Practice: This is more preference than best practice, but I like to use 3 separate IPs for these services. With alternative ports/port mapping, you can consolidate to a single IP, but unless you have a very specific reason for doing so, its best to stick with 3 separate IPs. You do burn more IPs by doing this, but you'll have to use non-standard ports for certain services if you use a single IP, and this could lead to issues with certain network devices that like certain traffic types on certain ports. Plus, troubleshooting, traffic statistics, logging are all cleaner if you are using 3 separate IPs.
Requirement 2: Traffic that is load balanced to the Lync Edge servers needs to return through the load balancer. In other words, if the hardware load balancer sends traffic to an Edge Server, the return traffic from that Edge Server needs to flow back through the load balancer. There are 2 common ways to ensure that return traffic flows through the load balancer. You can…
So there are your two options, which I will refer to as Routing and SNATting. With Routing, your Edge Server will rely on its routing table to route the return traffic out through the load balancer. No obscuring of the source IP address will happen on the load balancer, but you will have to make sure your default gateway & routing tables are correct. With SNATting, you can ensure return traffic goes back through the load balancer and not have to worry about the routing table to take care of this. The drawback to SNATting is that the load balancer will obscure the source IP of the packet as it passes through the load balancer.
I will explain below why the SNAT idea is less than ideal, primarily for A/V traffic.
Best Practice: You can SNAT traffic to the Web Conferencing and Access services on the Edge Server, but do not SNAT traffic to the A/V Edge Services. By obscuring the client's IP Address when using SNAT, you limit the ability for the A/V Services to connect clients directly to each other, and this is important when clients try to set up peer 2 peer communication, such as a phone call. When using SNAT, The A/V services will not see the client's true IP, so the likelihood of the Edge Server being able to orchestrate the 2 clients to communicate directly with each other is reduced to nil. You'll force the A/V services to utilize its fallback method, in which the P2P traffic will actually have to use the A/V server as a proxy between the 2 clients. Now this 'proxy' fallback mode will still happen from time to time even when your not SNATting at the BIG-IP (for example, multiparty calls will always use 'proxy'), but when you can, its best to minimize the times that users have to leverage this fallback method. So even though SNATting connections to the A/V Edge Service will seem to work, it is far from desirable from a network perspective!
FYI - Every load balanced service in a Lync Environment (including Lync FE's, Directors, etc) can be SNAT'ed except for the A/V Edge Service.
Requirement 3: Certain connections will need to be load balanced to the Edge Services, while certain connections will need to be made directly to those Edge Services.
Best Practice: Make sure clients can connect to the Virtual IP(s) that are load balancing the Edge Services, as well as make sure that clients can connect directly to the Edge Servers themselves. Typically users will hit the load balancer on their first incoming connection and get load balanced, but if a user gets invited to a media session that has started on an Edge Server, the invite they receive will point them directly to that server. NAT awareness was built into Lync 2010 to help in environments in which Edge Servers are deployed behind NATs. By enabling the NAT awareness, Edge Servers will refer clients to their respective NAT address in order to route the users in correctly.
Do I need to use routable IPs on the external interface of my Edge Servers? Microsoft says you do, and I would recommend doing so if you can. I have worked on deployments where non-routable IPs are being used (leveraging NATs to allow direct access) and not run into any issues. Just be sure that the Edge Servers are aware of their NAT address.
Best Practice: Suggested Deployment "DNAT in, SNAT out" on the Load Balancer
”DSNAT in, SNAT out” was derived from discussions with a certain MSFT engineer who helped me build this guidance. I’d love to give him credit (he knows Lync networking better than anyone I have ever talked to!!), but if named this person, his/her phone would never stop ringing for architecture guidance !!. Back to the subject, if you keep to "DNAT in, and SNAT out” for external-side Lync Edge traffic, your deployment will work! It sums it up very well!
So you're ready to architect your Edge Server Deployment. Lets take all the information from above and build a deployment. Keep these things mind…..
External Side of the Edge Servers
-Plan for VIPs on your BIG-IP to load balance the 3 external services that your Edge Server Provides (Access, WebConferencing, A/V)
-Plan for direct (non-load balanced) access to your Edge Servers by external clients
-Plan a method to allow Edge Servers to make outbound connections (forwarding VIP or SNAT on BIG-IP)
-Point the Edge Server's Default Gateway to the Self IP of the BIG-IP
-Point the BIG-IP's Default Gateway to the Router
-Do not SNAT traffic to the A/V Services on the Edge Servers
If you use non-routable IPs on the external Interfaces of the Edge Servers, create a NAT on the BIG-IP for each Edge Server. Make sure the Edge Servers are aware of these NAT addresses so they can hand them out to clients who need to connect directly to Edge Server.
Internal Side of the Edge Servers
-Plan for VIPs on your BIG-IP to load balance ports 443, 3478, 5061, and 5062 on the internal interfaces of your Edge Servers
-Plan for direct (non-load balanced) access to your Edge Servers
-Make sure your Edge Servers have routes to the internal network(s)
-You can SNAT traffic to the internal interface of the Edge Servers
I'll leave you with an example of a fully supported configuration (i.e. using routable IP Addresses all around). Keep in mind, this is not the only way to architect this, but if you have the available public IP address space, this will work.
Wow… so much for a short post. I welcome any and all feedback, and I promise to update this post with new information as it comes in. I'll also augment this post with more details & deployments as I find time to write them up, so check back for updates. This may even end up as a guide some day!
Version 1.0 date 7/14/2011
Version 1.1 date 2/15/2011 - Fixed a few typos. Fixed some heinous formatting
have a lync 2010 edge servers load balanced using the F5-LTM and have poor audio quality for external users. i have been stepping through the IAPP for lync to see if there is anything missing with no luck. your guide has been useful, however i am not sure what a DNAT is?
Can you confirm if using the Lync iAPP follows best practice or is there something i should be adjusting. i.e.load balancing method?
Dest Network 0.0.0.0%9, mask 0.0.0.0, service: all ports, all protocols, enabled on DMZ_VLAN only, Source addr translation None, Protocol Profile FastL4_Loosinit_LooseClose
Where the above FastL4_Looseinit_Loosclose profile has the settings; Parent : fastL4, Reset on Timeout: Enabled, Idle Timeout: Immediate, Loose Init: Enabled, Loose Close: Enabled
We then set the default gateway on the Edge server to the floating Self IP assigned to the F5 DMZ, and it all worked nicely.
Couple of gotchyas we had on the way.
Error 1 : Incorrectly setting the iApp to say Edge Internal Route is via the BIG IP, when in fact the edge routes directly to the internal VLAN via its internal interface, is bad. We could do external to external av calls, and internal to internal av, but internal to external or vice versa would not connect at all. Was getting a message “Call failed to establish due to a media connectivity failure when one endpoint is internal and the other is remote” :
Error 2: Found av calls to internal would work for 5 seconds, and then the session would drop after 35 seconds. Audio inbound would work for the full 35 seconds, and audio outbound would work for 5 seconds only before terminating with Network Failure. Error in logs was “Call terminated on a mid-call media failure where one endpoint is internal and the other is remote”. This was due to our initial setup of the wildcard VS to route outbound, only accepting traffic from the primary IPs on the edge servers. Once we expanded the the Source IP to include the entire edge Server External IP range, traffic flowed no problem.
Hope this helps someone else. Cheers
Thanks for the info...Great stuff!!! Can you elaborate more on what you mean by, "Best Practice: Suggested Deployment "DNAT in, SNAT out" on the Load Balancer." Thanks
We have implemented more or less successfully, but see a couple of things that are not in line with this guide: