Team Leader - Nutanix Technology Champion - Nutanix NTC Storyteller

Julien DUMUR
Infrastructure in a Nutshell

Let’s be honest: shutting down a complete Nutanix cluster is always a bit stressful. Even after 15 years in the business. Why? Because even with the best HCI technology on the market, cutting the power on an IT infrastructure is never trivial.

I’ve seen too many “cowboys” pull the plug or perform a brutal “Shutdown” via IPMI, thinking data resiliency would handle the rest. Spoiler alert: this often ends with Level 3 Nutanix support on the line to recover corrupt Cassandra metadata or with the loss of one or more disks.

This guide is my lifeline to ensure my cluster restarts without issues. No GUI, no Prism Element for the critical steps. We open the terminal, connect via SSH, and do it properly.

Phase 1: Health Checks

Before even thinking about stopping a single VM, you must ensure the cluster is capable of stopping (and more importantly, restarting). If your cluster is already suffering, shutting it down is not always a good option.

1.1 SSH Connection to the CVM

Open your favorite terminal (PuTTY works just fine) and connect via SSH to the cluster’s virtual IP address (Cluster VIP) with the user nutanix.

1.2 Nutanix Cluster Checks (NCC)

To ensure the cluster is healthy, it is necessary to run an NCC. Run a full check:

ncc health_checks run_all

My advice: Don’t just skim through the report. If you have a “FAIL” on Cassandra, Zookeeper, or Metadata, STOP. Fix it before shutting down. A warning about a full disk or an old NTP alert is acceptable. But data integrity is non-negotiable.

1.3 Resiliency Verification

The Prism dashboard is pretty; it tells you “Data Resiliency Status: OK”. That’s good, but it’s not precise enough for a total shutdown. I want to know if my data is truly synchronized, right now.

Type this command and look it in the eye:

ncli cluster get-domain-fault-tolerance-status type=node

What you need to see: A line indicating Current Fault Tolerance: 2 (or 1 depending on your RF configuration).

If you see a state indicating a rebuild in progress, do not shut down the cluster and wait for the rebuild to finish.

Phase 2: Shutting Down Workloads

Once the cluster is validated as healthy, we move on to the virtual machines. The classic mistake is rushing to stop the nodes, but this will be refused if virtual machines are still running on the cluster.

2.1 The Battle Order

Start by shutting down your test/dev environments, then application servers, and finally databases. It’s common sense, but it’s always good to be reminded.

Once all production machines are off, you can now shut down the remaining “tooling” VMs of your infrastructure: AD, DNS, firewalls…

2.2 Managing Prism Central

Connect to Prism Central via SSH with the nutanix account, then run the stop command:

cluster stop

Wait for the PCVM services to stop and verify that the cluster is indeed stopped:

cluster status

If all services are stopped and the cluster status is “stop”, we can now proceed to shut down the PCVM:

sudo shutdown -h now

Phase 3: Stopping Nutanix Services (“Cluster Stop”)

Your VMs and Prism Central are off. Your hosts are running nothing but CVMs (Controller VMs). This is the critical moment. We never perform an OS shutdown of the CVMs without first stopping the cluster services properly.

Why? Because a brutal shutdown of CVMs can lead to data corruption or metadata inconsistencies that might require support intervention.

3.1 Stopping the Cluster

Reconnect to your Nutanix cluster VIP and simply type:

cluster stop

The system will ask for confirmation before launching operations. Type Y.

This command orders each CVM to stop its services in a precise order. The Stargate service (which handles storage I/O) ensures everything is “flushed” to disk before shutting down.

You will see lines scrolling by indicating the stop of Zeus, Scavenger, Cassandra, etc. Be patient. Depending on the cluster size, this can take 2 to 5 minutes.

3.2 Verification

Once the operation is complete, check the actual state of services:

cluster status

What you need to see: A list of services for each CVM. They must all be in the DOWN state, with the potential exception of the Genesis service which may remain UP; this is normal.

If you see other services still UP, wait a minute and run the check again. Do not proceed until the cluster is logically fully stopped.

Phase 4: Shutting Down CVMs and Physical Nodes

We are at the end of the tunnel. The cluster is logically stopped. Only empty shells remain: the CVMs (which are Linux VMs, let’s not forget) and the hypervisors.

4.1 Stopping CVMs

You must now connect to each CVM individually (via its IP, no longer via the VIP) and run the shutdown command.

The official command:

cvm_shutdown -P now

The cvm_shutdown command contains specific hooks to notify the hypervisor. Repeat the operation on each node of the cluster.

4.2 Stopping Hypervisors

Once the CVMs are off, connect to your hosts (via SSH or IPMI) and on each of them type the following command:

shutdown -h now

The Expert Nugget: The Automation Script ⚡

Do you have a 16-node cluster and don’t feel like connecting 32 times (16 CVM + 16 Hosts)? I get it.

Here is a script to run from any CVM in the cluster that will shut down all CVMs, then all AHV hosts.

⚠️ WARNING: This script asks no questions. Ensure you have validated Phase 3 (cluster stop) before launching this, otherwise, a crash is guaranteed.

The “Kill Switch” Script (For AHV)

From a CVM, this script retrieves the IPs of other CVMs and hosts, then sends the shutdown order in sequence.

for svmip in `svmips`; do ssh -q nutanix@$svmip "sudo /usr/sbin/shutdown +1 ; hostname"; done
for hostip in `hostips`; do ssh -q root@$hostip "/usr/sbin/shutdown +3 ; hostname"; done
  • The first command orders the shutdown of CVMs after a one-minute delay.
  • The second command orders the shutdown of nodes after a 3-minute delay.

Once you have launched the commands, you will lose connection after one minute. You can then monitor the shutdown of your nodes from their respective IPMI interfaces.

Phase 5: Powering Back Up (Cold Boot)

The maintenance period is over. What do we do? Press ON and pray? No, we follow the reverse order.

  1. Physical Network: Turn on your Top-of-Rack switches first. If the network isn’t there, the nodes won’t see each other upon booting.
  2. IPMI / Physical: Turn on the physical nodes.
  3. Patience: AHV will boot, then automatically start the CVM.
    • Tip: Don’t touch anything for 10 minutes. Let the CVMs form the cluster.
  4. Starting the Cluster: Connect via SSH to a CVM. Verify that all CVMs are up (svmips should list them all). Then:cluster start
  5. Verify that the cluster has started properly with the command:cluster status
  6. Starting Workloads: Once the cluster is UP, power on the PCVM first, then your VMs (Infra first, Appli second).

Conclusion

Shutting down a Nutanix cluster is a simple procedure but requires good sequencing. It’s not complicated, but it doesn’t forgive impatience. If you follow these steps, you’ll sleep soundly during the power outage.

0 comments

Leave a Reply