Last Thursday evening and into Friday morning I worked on a server that had a hard drive crash. Another annoying example where a mirrored drive did no good at all. At any rate, spent most of the night restoring from an image backup.
The next morning I had calls that some users could not login to the domain. On one computer I unjoined and rejoined the computer to the domain and it worked, but not so for others I tried. No user account could login to these machines. I logged in as local admin and could see in the error logs it was complaining of Kerberos errors.
This environment has two domain controllers and both are Server 2003. The server that crashed was a secondary domain controller. On that server I could see lots of various error messages in the logs and it was incredibly slow. The master domain controller complained the name of the secondary controller was invalid.
I didn't record all of the errors I found but here is the one that stood out and ultimately led me to the fix.
Note: So as to protect the identity of my client, let's refer to the Master Domain Controller as PARENT and the Secondary Domain Controller as CHILD.
Had this error on CHILD:
Event Type: Error
Event Source: NTDS Replication
Event Category: Replication
Event ID: 1864
Time: 2:51:22 AM
User: NT AUTHORITY\ANONYMOUS LOGON
This is the replication status for the following directory partition on the local domain controller.
Ran this command on both servers: repadmin /showreps
On CHILD it said authentication failure (meaning it could not authenticate to PARENT).
On PARENT is said "Target Principal Name is Incorrect" (meaning it could not find the CHILD server).
Ran this command on CHILD:
netdom resetpwd /server:PARENT /user:customerdomain\administrator /password:adminpassword
This basically reset the kerberos encryption key to allow them to start talking to each other.
Kept running this command again every few minutes on both servers: repadmin /showreps
After about 10 minutes they both began talking to each other again and all errors disappeared. After rebooting the client PCs that were having trouble we could then login to the domain normally.