Wednesday, June 25, 2008

Differential Analysis - WDS & DHCP Separation

This post outlines the issues and resolution that Tom and I uncovered while removing DHCP from a Windows Deployment Services (WDS) system and moving it to a separate system. The post is rather lengthy, so if you're seeking a solution to this problem, we haven't found one. There is a bulleted list of our take-aways and thoughts so far at the end of this post. The title of this post includes differential analysis because Tom and I compared the functional states of two environments with the non-functional state of our broken system to try to determine a solution.

A month ago, Ron and Tom setup an Active Directory domain to demonstrate the capabilities of WDS. A few weeks ago, Kristian and I added Server 2008 clustering capabilities to the AD environment. Elaboration regarding this environment will happen in the future.

Yesterday, Tom and I wanted to move the DHCP service from the WDS server to the cluster, as to provide highly available DHCP. So we had two servers: one running WDS + DHCP (hereafter referred to as the WDS Server) and another running DHCP (hereafter referred to as the DHCP Server). The goal was to split DHCP and WDS, so we copied the DHCP options from the WDS Server in the picture below to the fresh new DHCP server.

Working Options from WDS Server



Options on the DHCP Server

We rebooted a workstation whose operating system had been deployed from our WDS Server prior to our WDS & DHCP split. The workstation churned along at the PXE screen and then displayed the following PXE error message:

PXE-E55 Proxy DHCP Service did not reply to request on port 4011


Uh oh. We called it a day.


Today, Tom and I revisited the problem by attaching some hubs to our imaging infrastructure and playing the packet capture game. The WDS server is 10.150.150.1 and the DHCP server is 10.150.150.23 -- the following DHCP scope options were configured when the issue was occurring.

Capture of the problem

The packet capture above shows the problem. The workstation going through the PXE process grabs an IP from the DHCP server and then sends a DHCP discover to port 4011 of the DHCP server. (Note that the error we receive on the workstation mentions port 4011.) Then, the DHCP server replies with an ICMP port unreachable message -- an active rejection of the packet.

So, when we noticed this, we knew the problem was going to be getting the workstation to send that second DHCP discover to the WDS server on port 4011 rather than back at the DHCP server. We captured the traffic for a working DHCP + WDS transaction thinking we could compare the working setup with our target setup.


Capture of the working DHCP+WDS transaction

So, we tried mucking with some settings on both the DHCP and WDS servers based on analyzing the differences in the DHCP ACKs from the working (packet #31 - capture of the working DHCP+WDS transaction) and non-working (packet #23 - capture of the problem) captures and no combination of configuration changes led to a different error or a success. Some of the settings we messed with include DHCP Option 54 Server Identifier, Do not listen on port 67, and changing DHCP Option 66 to a non-existent IP address in the working environment to see if the change would break the system.

So we started searching Google some more and came across this Microsoft page. Microsoft tells us, "
Important: Microsoft does not support the use of these options on a DHCP server to redirect PXE clients." Well, thanks, but no thanks.

Then we remembered that we have a working pxelinux environment. The pxelinux configuration files are served up by Microsoft's TFTPD and DHCP is offered by Microsofts DHCP 2003 service. Further, the DHCP and TFTP servers are separate! (oh, and IT WORKS)

We decided to setup another capture session, this time monitoring our working pxelinux environment.


Capture of the working pxelinux DHCP+TFTP transaction

Then, Tom expanded the DHCP ACK and noticed DHCP option 43 was used!

DHCP option 43


So, Tom updated the DHCP server settings in our WDS environment accordingly.

Updated DHCP options (working!)

And, voila! The workstation in the WDS environment now directs TFTP GETs to the WDS server right after the DHCP transaction. Cool.


Capture of working target setup

So, it appears that from our experiment and our working pxelinux environment, the presence of DHCP option 43 with a value of 010400000000FF a PXEClient immediately sends a TFTP get to the DHCP option 66 value for the file value listed in DHCP option 67.

We wanted to make sure, so we changed DHCP option 66 to a non-existent IP address, and the workstation failed with the message: PXE-E11 ARP Timeout. A capture of this event showed that the workstation received an address and tried to ARP requested for the non-existent IP address. This led us to further believe our claim about DHCP option 43.

Capture of ARP Timeout

Re-inspection of the expanded DHCP option 43 in wireshark shows the sub-option PXE mtftp IP setting with no value. We're somewhat confused what this sub-option means, although we've already hypothesized and proven what it accomplishes in the PXE environment. A simple Google for PXE Specification finds a document that might contain documentation about what this stuff means.

So, we tried to actually boot into PE 2, but it failed with the message:

WDSClient: There is a problem initializing WDS mode

Suck. The clear difference between our target environment and the working WDS environment is that a second DHCP request/ACK doesn't occur. The ACK in this communication contains DHCP option 252, Proxy Autodiscovery. A few more captures of the working WDS environment proved that this value changes per DORA/RA scenario.


DHCP Option 252

It looks like we'll have to do some more digging into how WDS dynamically creates BCD files, etc. Expect another post regarding our end environment in the future.

Remaining thoughts:
  • Do we lose any functionality by removing DHCP from the WDS server and implementing it elsewhere?
    • Are there automatic changes to Option 67 by the WDS server?
    • Are there other lost functions we don't know about or can't think of now?
      • Probably
  • The target ending architecture includes WDS outside of the high availability cluster.
    • Can we distribute WDS across the cluster nodes, and use network load balancing to make TFTP via WDS highly available in a similar sense as clustered high availability?
      • We shall see...
      • Could this solve our dynamic BCD creation issues?
Lessons of the day:
  • Differential analysis -- the comparison of system states -- to solve problems is strong and effective. Not only can it be used in cryptanalysis or other math-oriented problem solving situations, it can be used in system administration. Thankfully, RIT's ANSA degree program taught us how to read packet captures.
  • Sitebooks are great!
    • We had documentation about this DHCP option 43 for our pxelinux environment, but we didn't look at it. In the old documentation, we should have sought to understand what the option accomplished for our pxelinux environment.
    • This post is a sitebook!
Procedures:
  • To detach DHCP from your WDS server, you need the following options in DHCP options defined in the new DHCP service
    • Predefined Option 43 - 010400000000FF
    • Custom-made Option 60 - String - PXEClient
    • Predefined Option 66 - IP or Hostname of the WDS Server
    • Predefined Option 67 - filename in WDS for architecture ( in our case it was boot\x86\pxeboot.com )

4 comments:

c0re said...

hey guys!
u missed some documentation:
u need options 60 only if u use DHCP+WDS on same server, but if they are on different servers u need to use only 66 and 67 options, leaving 60 option unset

that's working in my invirement

cheers, c0re

skyman said...

option 60 means check local dhcp server for tftp...don't use if ris/wds and dhcp are on different servers

Urobe's Memoiren said...

hey guys!
Just want to say:
THANK YOU!!!!!

wish you all the best
Edi Pfisterer/Austria

PS: for me, its working fine WITH option 60 (PXE-Client)

Jon Anderson said...

I had this same problem. It was fixed by removing option 60 from the DHCP server, rather than adding additional configurations.