- Why PKI Disaster Recovery is Different?
- What Fails When a CA Goes Down?
- Backup Scope: What Must be Protected
- Emergency Procedure: CRL Re-Signing
- RTO and RPO by PKI Component
- Failover Architectures
- DR Testing: Runbook and Schedule
- DR Considerations During CA Migration
- Choosing the Right DR Model
- How Encryption Consulting Can Help?
- Conclusion
Most organizations invest significant effort in building a PKI, designing the Certificate Authority (CA) hierarchy, securing root keys in an HSM, standing up OCSP responders, and configuring certificate templates. Far fewer invest the same rigor in planning what happens when that PKI fails.
A CA failure is not a hypothetical edge case. Hardware dies, data centers flood, and ransomware does not skip security infrastructure. When a CA goes offline, the consequences cascade fast: new certificates cannot be issued, VPN tunnels fail to establish, web servers serve untrusted certificates, and critically Certificate Revocation Lists (CRLs) stop being refreshed. Once a CRL expires, every application that performs revocation checking will start rejecting valid certificates, bringing down systems that have nothing to do with the CA itself.
This guide covers everything from backup scope and emergency procedures to failover architectures and DR testing runbooks, with practical guidance for both self-managed and managed PKI environments.
Why PKI Disaster Recovery is Different?
PKI disaster recovery has a property that most other DR scenarios do not: the clock is already running before you know something is wrong.
When a CA fails, it can no longer sign or publish CRLs. But CRLs already published remain valid for their configured validity period, typically 7 days for a base CRL and 24 hours for a delta CRL. This means that from the moment a CA goes offline, you have a fixed window to either restore the CA or use emergency procedures before applications begin failing.
Consider two scenarios:
Scenario A: Your issuing CA publishes a base CRL every 7 days and a delta CRL every 24 hours. The delta CRL expired 2 hours ago when the CA went offline. You have approximately 22 hours before relying parties can no longer validate the delta CRL, and 5 days before the base CRL expires.
Scenario B: Your issuing CA publishes a base CRL every 7 days and no delta CRL. The base CRL was published 4 days ago. You have 3 days before certificate validation begins failing enterprise-wide.
The design of your CRL publication schedule directly determines your disaster recovery time budget. This is not just a configuration detail it is a DR decision.
The second unique property of PKI DR is key custody. Recovering a CA without the original private key is impossible. The backup of that key, and the ceremony around accessing it, must be planned and tested before you ever need it.
What Fails When a CA Goes Down?
Before designing a recovery strategy, it helps to understand exactly what breaks and in what order:
Immediately (minutes):
- New certificate issuance stops
- OCSP responders that rely on the CA’s live database stop returning authoritative responses
- Any enrollment services (NDES, web enrollment, SCEP) become unavailable
Within hours to days (depending on CRL validity window):
- Delta CRLs expire: applications performing revocation checking using delta CRLs begin failing
- Base CRLs expire: all relying-party applications that check revocation begin rejecting certificates
Long-term:
- The CA certificate itself may approach expiry if recovery is delayed for an extended period
- Trust chains for cross-certified or subordinate CAs are disrupted
Understanding this timeline lets you triage an OCSP outage without a corresponding CA failure, which is urgent but not catastrophic; a CA failure with an expired delta CRL is an immediate incident requiring all hands on deck.
Backup Scope: What Must be Protected
A CA backup is only useful if it contains everything needed to restore the CA to a different server. The following components must all be included:
Root and Issuing CA Private Keys
The private key is the most critical and the most dangerous artifact in your PKI. It must be protected at rest using an HSM or, at minimum, a password-protected PKCS#12 archive stored in a physically secured, access-controlled location. Losing the private key means the CA cannot be recovered ever. A new CA hierarchy must be built from scratch.
For organizations using an HSM, follow your vendor’s key backup and restoration procedures (Luna, nShield, and Utimaco each have specific HSM key backup workflows). A backed-up HSM key should be stored in a separate physical location from the primary HSM.
Certificate Database
The CA database contains a record of every certificate ever issued or revoked. Without it, you lose the complete issuance history and cannot reconstruct revocation information. For ADCS-based CAs, this is the Jet/ESE database located at:
%SystemRoot%\System32\CertLog
Registry Configuration
The CA’s configuration — including CRL settings, certificate extensions, validity periods, and publication points — is stored in the Windows registry under:
HKLM\SYSTEM\CurrentControlSet\Services\CertSvc\Configuration
Export this as a .reg file with every backup.
CAPolicy.inf
This file defines the CA’s certificate policy during installation. While it is not required to restore a running CA, it is essential documentation for rebuilding a CA if the private key is compromised.
Certificate Templates
In an Active Directory environment, certificate templates are stored in the AD Configuration partition, not on the CA server itself. However, any custom template definitions should be documented per-attribute so they can be recreated if Active Directory itself needs to be rebuilt.
CRL Files
Back up the current base CRL and delta CRL files. In an emergency, these files can be republished to a new distribution point or used as the basis for a CRL re-sign operation.
Encryption Consulting’s CA Backup Script automates this entire scope in a single scheduled PowerShell execution, including log truncation and event logging.
Emergency Procedure: CRL Re-Signing
CRL re-signing is the single most important emergency technique every PKI administrator should know and the one most commonly missing from DR runbooks.
When a CA fails, and a CRL is approaching expiry, you do not necessarily need to restore the full CA to prevent a revocation outage. If you have a backup of the CA’s private key, you can use that key to re-sign an existing CRL and extend its validity period. This buys you additional time hours or days to complete the full CA restoration without impacting certificate validation enterprise-wide.
When to use it: When a CA is offline, and a CRL is within its overlap window or has already expired.
What you need:
- A backup of the CA’s public/private key pair (PKCS#12 or HSM-backed key material)
- The most recent CRL file from your distribution point or backup
High-level steps:
- Import the CA key pair to a secure, temporary workstation
- Use certutil -sign (ADCS) or the equivalent in your PKI platform to re-sign the CRL with an extended validity period
- Publish the re-signed CRL to all configured distribution points (HTTP, LDAP)
- Monitor OCSP and CRL-checking applications to confirm they accept the renewed CRL
- Proceed with full CA restoration in parallel
CRL re-signing is not a substitute for CA recovery it is a bridge that prevents a revocation outage while recovery is underway.
RTO and RPO by PKI Component
Not all PKI components have the same recovery requirements. Defining RTO (Recovery Time Objective) and RPO (Recovery Point Objective) per component helps prioritize effort and investment.
| Component | Typical RTO Target | Typical RPO Target | Notes |
|---|---|---|---|
| OCSP Responder | < 15 minutes | Near-zero | High-impact immediately; consider active-active OCSP |
| Issuing CA | < 1 hour | Last backup (daily or more) | CRL window determines criticality |
| Root CA | < 4 hours | Last backup | Offline Root CA adds ceremony time |
| CA Database | Last backup | Last backup | Daily or continuous replication |
| Certificate Templates | Low urgency | Documented configuration | Stored in AD; recoverable if AD is healthy |
For organizations with a formal SLA or regulatory requirement, these targets should be documented in a PKI-specific Business Continuity Plan and reviewed annually.
Failover Architectures
Active-Passive DR CAs
The most common pattern for enterprise PKI is an active-passive deployment: a primary issuing CA handles all certificate issuance during normal operations, while one or more DR issuing CAs sit in a standby state, pre-configured and tested but not issuing certificates.
Key design principles for this architecture:
- DR CAs should be pre-installed and configured with the same certificate templates, CRL distribution points, and AIA extensions as the primary. The cost of standing up a DR CA during an incident, under pressure, is disproportionately high.
- Configuration synchronization must be explicit. When templates or policies change on the primary, those changes must be replicated to DR CAs. A common failure mode is discovering that a DR CA has a configuration that diverged from production six months ago.
- Test failover regularly. Issue a test certificate from the DR CA at least quarterly to confirm it is operational. A DR CA that has never been tested is not a DR CA — it is a liability.
Active-Active Issuing CAs
For organizations with high certificate issuance volumes or strict availability SLAs, active-active is preferable. In this model, two or more issuing CAs share the load through DNS round-robin or a load balancer, and either can independently issue certificates.
This approach requires careful database strategy: each CA maintains its own certificate database, which means issuance history is split across CAs. CRL publication must account for this by having each CA publish its own CRL, and OCSP responders must be configured to query both.
Offline Root CA Considerations
The Root CA in most enterprise PKI designs is kept offline, powered down, and physically secured when not in use. This dramatically reduces the attack surface but introduces ceremony considerations for DR operations.
A properly designed offline Root CA DR procedure includes:
- Two-person integrity (M-of-N key access): Use Shamir’s Secret Sharing or HSM-enforced quorum controls (e.g., “M out of N custodians must be present”) so that no single individual can access the Root CA private key
- Physical key ceremony documentation: Every time the Root CA is powered on, the steps taken should be recorded, witnessed, and archived for audit purposes
- Tested restoration path: At least once per year, walk through the full Root CA restoration procedure on isolated hardware to confirm that the offline media, key backups, and documentation are sufficient to rebuild the CA
DR Testing: Runbook and Schedule
A DR plan that has never been tested is a hypothesis. The following schedule provides a practical baseline for PKI DR testing.
Monthly
- Verify that automated CA backups completed successfully and that backup files are accessible
- Confirm CRL and delta CRL publication are current on all distribution points
- Confirm OCSP responders are healthy and returning responses within SLA
- Review CA event logs for any errors or warnings
Quarterly
- Perform a full CA restore on an isolated test environment using the most recent backup
- Issue a test certificate from each DR/standby CA
- Confirm that the configuration between the primary and DR CAs is synchronized
- Validate that the CRL re-signing procedure can be executed successfully (using a non-production key)
Annually
- Execute a full failover simulation: take the primary issuing CA offline and confirm that the DR CA takes over without manual intervention
- Walk through the offline Root CA restoration ceremony
- Review and update the DR runbook for any infrastructure changes
- Review RTO/RPO targets against business requirements and adjust if needed
DR Considerations During CA Migration
CA migration introduces a temporary window where DR posture is degraded. The old CA may be decommissioned before the new CA’s DR environment is fully stood up. Specific mitigations:
- Do not decommission the old CA before the new CA’s DR CA is operational. The old CA should remain as a fallback until the new CA hierarchy has been proven stable.
- Ensure CRL continuity during migration. If CDP/AIA points are changing, the old CRL distribution points must remain accessible for the lifetime of any certificates issued by the old CA.
- Back up the old CA before any migration step. A backup taken immediately before migration begins is the most important recovery artifact you have.
- Define rollback criteria in advance. Agree on specific conditions that would trigger a rollback to the old CA before the migration begins, not during it.
Choosing the Right DR Model
Self-Managed PKI
If you operate PKI in-house, the guidance in this article gives you the building blocks of a DR program. The minimum viable posture is:
- A scheduled, tested CA backup process covering all components listed above
- At least one pre-configured DR issuing CA
- A documented CRL re-signing procedure with the key material to execute it
- A quarterly test cycle
Managed PKI (PKIaaS)
For organizations that want DR guarantees without the operational overhead, a managed PKI service shifts the DR responsibility to the provider. Encryption Consulting’s PKI-as-a-Service includes proactive monitoring, active incident response, and a dedicated team available for DR scenarios — as demonstrated in our PKIaaS success story, where all global outage incidents were resolved within one hour.
Certificate Lifecycle Management
Regardless of whether PKI is self-managed or outsourced, having real-time visibility into your certificate inventory is a DR multiplier. CertSecure Manager provides a centralized, always-current view of every certificate in your environment — so that when recovery begins, you know exactly what was issued, what is expiring, and what services are at risk, without having to reconstruct that picture from a cold backup.
How Encryption Consulting Can Help?
Encryption Consulting’s PKI team designs and implements DR-ready PKI architectures that align with your organization’s RTO/RPO requirements and compliance obligations. Whether you are building DR into an existing PKI, planning a migration, or looking to offload the operational complexity entirely with PKIaaS, our team brings hands-on expertise across ADCS, cloud PKI, and multi-vendor environments.
- PKI Services: End-to-end PKI design, implementation, and DR planning
- PKI-as-a-Service: Fully managed PKI with built-in DR and 24/7 monitoring
- CertSecure Manager: Certificate lifecycle management with real-time inventory and expiry alerting
- CA Backup Script: Free PowerShell utility for automating comprehensive CA backups
- HSM-as-a-Service: High-assurance key protection with DR-ready HSM deployment
Contact us to discuss your PKI DR requirements, or request a demo to see CertSecure Manager in action.
Conclusion
PKI disaster recovery is not a one-time configuration task — it is an ongoing operational discipline. The key principles:
- Design CRL validity windows with DR in mind. The window between your last published CRL and its expiry is your recovery time budget.
- Protect the private key above all else. Every other component can be reconstructed; the key cannot.
- Know the CRL re-signing procedure before you need it. It is the most valuable emergency technique in PKI DR and the least commonly practiced.
- Test your DR. A DR CA that has never issued a test certificate is an untested assumption.
- Plan DR into every migration. A migration is a period of elevated risk; DR posture must be maintained throughout.
- Why PKI Disaster Recovery is Different?
- What Fails When a CA Goes Down?
- Backup Scope: What Must be Protected
- Emergency Procedure: CRL Re-Signing
- RTO and RPO by PKI Component
- Failover Architectures
- DR Testing: Runbook and Schedule
- DR Considerations During CA Migration
- Choosing the Right DR Model
- How Encryption Consulting Can Help?
- Conclusion
