Team Leader - Nutanix Technology Champion - Nutanix NTC Storyteller

Julien DUMUR
Infrastructure in a Nutshell

Those who’ve been following the blog for a while surely remember my “Maxi Best-Of Nutanix CLI” series. I loved writing it, and obviously, it saved the day for some of you more than once. But with the announced end of SSH connections on Nutanix clusters, the days of firing up PuTTY to SSH into a CVM (Controller VM) and typing our ncli or acli commands as we please are over. Today, with the rise of Zero Trust architectures and STIG (Security Technical Implementation Guides) compliance requirements from the US DoD, locking down low-level access is no longer a paranoid option: it’s the production standard.

Why is the API becoming the only master on board?

Let’s be very clear: the REST API is the admin’s essential new Swiss Army knife.

It started with the progressive hardening of our infrastructures, notably with cluster lockdown, but the sentence officially fell in late January 2026: Nutanix announced the End of Support Life (EOSL) for “Bash Shell Access”. The reason is obvious. Leaving direct, unrestricted Bash access to the underlying OS (whether on AOS, AHV, or even Prism Central) has become absolute nonsense when it comes to guaranteeing security, auditability, and long-term support.

Here is the timeline I found in this document:

  • Under AOS 7.0 to 7.5, we started seeing login warnings, the appearance of an info alert warning of the deactivation of SSH password authentication, and options to disable SSH manually.
  • Starting from the next major NCI release, Bash will be disabled by default. Instead: an ultra-restricted “SSH Service Menu” allowing you to run a few acli/ncli commands to perform basic troubleshooting / actions.
  • And for late 2026? The “SSH Service Menu” is in place, the bash shell is disabled, but can be reactivated in “Support-Only” mode via a temporary token provided when handling a ticket with Nutanix support.

The only official, tracked, and complete entry point to interact with the infrastructure via command line is now the API.

Whether it’s the Prism Element API (v2.0) on an isolated cluster, or the Prism Central API (v3/v4) for multi-cluster management, there’s no escaping it: you have to dive in.

The API toolkit

Before we can blast our clusters with API requests, we need to gear up.

Daily, I switch between two go-to tools: Curl when I have a Linux/WSL terminal at hand (fast, raw, scriptable), and Postman when I need to visually explore Nutanix APIs.

If your workstation isn’t ready yet, I’ll point you directly to the dedicated article I’ve already written on this topic: Configure your Windows PC to query Nutanix APIs (WSL & Postman).

However, even when well-equipped, I regularly see admins tearing their hair out over their first Prism Element requests because of two crucial details.

The SSL certificate: By default, a Prism Element cluster uses a self-signed certificate. If you send a standard request, it will be violently rejected. The field reflex? Always add the -k (or --insecure) flag in your curl commands, and remember to disable the SSL certificate verification option in Postman settings.

Authentication: The Prism Element v2.0 API relies on Basic Auth. Avoid passing your credentials in plain text in the URL of your request (like https://admin:MySuperPass@IP...). It inevitably ends up in plain text in the history or logs. In Postman, use environment variables!

Exploring the API with the built-in REST API Explorer

How many times have you banged your head against the wall looking for the right syntax in a 500-page PDF documentation? With Nutanix, forget about that. The best documentation isn’t on the support portal; it’s directly embedded in your cluster.

Prism Element natively integrates an interface called REST API Explorer. To access it, it’s very easy: log in to your cluster’s web interface, click on your username at the top right, then select REST API Explorer. You can also type the URL directly: https://<CVM_IP>:9440/api/nutanix/v2/api_explorer/index.html.

The true power of this Swagger isn’t just listing the endpoints (GET /cluster, POST /vms…). It’s its ability to code for you! Fill in the required fields in the interface and click the “Try it out!” button. Not only does the interface execute the request and display the raw JSON response, but above all, it generates the complete and perfectly formatted curl command. It’s the ultimate hack to save time and avoid syntax errors.

Conclusion

I won’t hide it from you, moving from CLI to REST API took a little effort to adapt. At first, I fumbled around, I grumbled at a malformed header or a temperamental JSON. But once you’ve crossed that milestone, it becomes almost natural. The API opens the doors to large-scale automation and continuous integration. However, you unfortunately won’t be able to find equivalents for every single CLI command…

In the next article of this series, we’re going to get to the heart of the matter. Enough theory, we’ll tackle practice with our first concrete use case: The complete health check of a cluster.

Read More
openclaw on nutanix ahv

If you read my previous article detailing the architecture and the technical stack I chose to deploy OpenClaw, you already know why I decided to run this solution on my Nutanix AHV cluster. Today, we’re getting practical! I will show you, step by step, how to deploy your own instance on a freshly installed Ubuntu virtual machine.

Before kicking off the hostilities, here is a quick reminder of my setup. I provisioned a VM on Nutanix AHV with:

  • 8 vCPUs
  • 32 GB of RAM
  • 250 GB of storage
  • an NVIDIA Tesla P4 graphics card in PCI Passthrough

💡 Why favor full Passthrough over vGPU (virtual GPU)? Quite simply to guarantee near “bare-metal” inference performance. By giving our VM direct and exclusive access to the physical hardware, we completely eliminate the overhead (latency) associated with the virtualization layer.

Let’s start the deployment.

Preparing the Ubuntu VM: System and NVIDIA Drivers

The very first step is to prepare the ground to deploy our AI.

Ubuntu 24.04: Operating System Update

This is a rule I apply every single time I deploy a new operating system. As soon as I connect via SSH, I make sure all packages are up to date to avoid future security flaws or dependency conflicts.

sudo apt update && sudo apt upgrade -y

GPU: Installing NVIDIA Drivers

For OpenClaw to harness the computing power of my Tesla P4, the operating system must be able to communicate with it properly. Here are the commands to run to install the drivers (you can access a more detailed guide on the blog):

sudo apt install nvidia-driver-535-server -y
sudo reboot

Once the machine has rebooted, we log back in and type the command to verify that our GPU is properly detected and ready to work:

nvidia-smi

Node.js and OpenClaw

Installing Node.js 22

OpenClaw is built on Node.js. To ensure we have a recent and efficient runtime environment (here version 22), we add the official NodeSource repository before launching the installation:

curl -fsSL [https://deb.nodesource.com/setup_22.x](https://deb.nodesource.com/setup_22.x) | sudo -E bash -
sudo apt install -y nodejs

Basic OpenClaw Deployment

Now that NodeJS is in place, we move on to installing OpenClaw. A simple curl script provided by the developers does the heavy lifting:

curl -fsSL [https://openclaw.ai/install.sh](https://openclaw.ai/install.sh) | bash

Once the installation is complete, the system automatically launches the configuration wizard for your instance. I will detail this step as well as the creation of API keys (Discord, Telegram, etc.) in a future blog post.

A small manipulation is required right after the OpenClaw installation if we want to be able to use “openclaw” commands without constraints. We need to add the local installation directory to our PATH environment variable (remember to adapt the username if you are not using administrateur):

export PATH="/home/administrateur/.npm-global/bin:$PATH"

💡 Why this manipulation? It’s an excellent security practice that I highly recommend. By exporting the PATH to ~/.npm-global/bin, we avoid installing global NPM packages with root (sudo) privileges. This significantly reduces attack surfaces and saves you from the eternal Linux permission conflicts!

Cleanly Exposing OpenClaw with Caddy

By default, the OpenClaw web interface listens on port 18789. Instead of attacking this port directly, I always prefer to place a reverse proxy in front of my applications. For this lab, my choice fell on Caddy.

sudo apt install -y caddy

💡 Why Caddy rather than Apache or Nginx? Because Caddy is formidably efficient. Where Nginx sometimes requires long configuration blocks for simple proxying, Caddy does the same job in literally three lines of code, all while being ultra-lightweight.

We edit its configuration file:

sudo vi /etc/caddy/Caddyfile

And we replace the entire content with the following instructions (replace the IP with the one of your VM, in my case 192.168.84.134):

192.168.84.134 {
    reverse_proxy 127.0.0.1:18789
}

Now, all that’s left is to restart the service so the proxy takes over:

sudo systemctl restart caddy

Network Security: Locking Down the OpenClaw Instance

Having a functional instance is good, securing it is essential. Even if you are on your local network (LAN), you should never leave open access to your control interface. We are going to apply a strict configuration via the OpenClaw CLI commands.

We start by restricting the Gateway listening to the local loopback to prevent any direct access:

openclaw config set gateway.bind loopback

We then force the operating mode to local, and activate token authentication (the bare minimum):

openclaw config set gateway.mode local
openclaw config set gateway.auth.mode token

Finally, since we are going through Caddy, we must authorize Cross-Origin requests (CORS) coming from our IP address, otherwise the browser will block the page (don’t forget to adapt the IP):

openclaw config set gateway.controlUi.allowedOrigins '["[https://192.168.84.134](https://192.168.84.134)"]'

We restart the service to apply our lockdown:

openclaw gateway restart

💡 The security pattern applied here is akin to local “Zero Trust”. By forcing OpenClaw on the loopback (127.0.0.1), we ensure that absolutely all traffic is forced to go through our Caddy proxy. Coupled with CORS filtering and authentication, we provide a baseline protection for our instance against potential scans or malicious scripts on the network.

First Contact and Configuration Validation

Retrieving the Access Token

Now that the doors are locked, we need the key. The authentication token was automatically generated during installation. We’re going to go fish it directly out of the JSON configuration file:

grep -i token ~/.openclaw/openclaw.json

Carefully copy this string of characters. Then open your browser and access your Web interface (e.g., https://192.168.84.134).

Enter the token in the “Gateway Token” box.

Device Approval

Once connected, you will notice that something is missing: the system is waiting for us to approve the “device” (the PC or tablet from which we wish to use OpenClaw) to grant it the right to process requests.

Return to your terminal to list the pending devices:

openclaw devices list

Locate your device ID in the list (a UUID-type string) and approve it:

openclaw devices approve b7beb7fa-fa4e-46e9-aec1-282bcce881f6

💡 Device approval (devices approve) is much more than a simple interface formality. It’s a sort of cryptographic handshake. This mechanism guarantees that no unsolicited machine can attach itself to your OpenClaw cluster instance without your knowledge!

Interaction Tests

The OpenClaw instance is now 100% operational! To validate our entire stack, there’s nothing like a full-scale test. You can send a first prompt on the web interface’s integrated chat, or configure a bridge to send a message on the Discord side.

Conclusion

We went from a simple Ubuntu VM to a true secured inference server, powered by Node.js and accelerated by a dedicated NVIDIA Tesla P4 GPU via Nutanix AHV. The architecture is clean, secured behind a Caddy proxy, and ready to handle our requests.

But this is only the beginning. In upcoming articles, we will go even further: I will show you how to configure OpenClaw via the startup wizard, deploy local models via Ollama, create an interactive Discord bot, and even inject Google API keys to equip our AI with search capabilities. Stay tuned!

Read More

If you follow my ramblings on the blog, you know I love tinkering with my clusters and testing somewhat out-of-the-box stuff (cf. my Steamdeck articles for example). Recently, I had a thought: Gemini or Claude in the public cloud is great for coding a Python script or writing emails. But when it comes to asking it to interact with our local infrastructure, that’s where it gets stuck.

So I wondered how I could connect artificial intelligence closer to my VMs. With this in mind, I got my hands on OpenClaw. Honestly, it was a bit of an obstacle course at the start. No more simple conversational gadgets, here we are talking about deploying a true Private AI on a Nutanix AHV cluster capable of acting on our infrastructure. Let me present the tech stack I chose for this experiment.

What is OpenClaw?

For those who have been living in a cave these past few months, OpenClaw is a GitHub project that exceeded 300k stars in just a few months. Imagine an ultra-intelligent thought translator coupled with a butler. Instead of clicking through dozens of menus in a complex interface, you simply ask your infrastructure to work for you in natural language (via a universal web interface or even messaging apps like WhatsApp and Telegram). It is even capable of working on its own while you sleep!

But where it gets exciting for us engineers is under the hood. OpenClaw is not just another “stateless” Large Language Model (LLM) that forgets everything with each new request. It is a true Agentic Gateway. Concretely, this means it orchestrates autonomous agents equipped with tools. These agents can be configured to tap directly into our cluster’s private APIs (like the REST APIs of Prism Element or Prism Central), code, browse the web, and synthesize certain information. In short, we don’t just ask the AI questions anymore, we delegate tasks to it.

Why Self-hosted?

In the field, the question of data governance arises the second the word “AI” is pronounced. Out of the question to send sensitive information to servers over which I have no control!

Choosing the Self-hosted route with OpenClaw means taking back absolute control. Data flows, execution logs, and API credentials stay locked down warm and safe on my network, isolated from the internet if desired.

Architecture and Tech Stack

For this project, a simple “Next, Next, Finish” on the corner of a table was out of the question. Here is the robust technical architecture I ended up validating for my deployment.

The Foundation: Nutanix AHV & Ubuntu 24.04 LTS

To run this beast, you need solid foundations. I provisioned a virtual machine running Ubuntu 24.04 LTS hosted directly on my Nutanix AHV cluster.

On the sizing side, I went with 8 vCPUs, 32 GB of RAM, and 250 GB of dedicated storage. You might tell me: “32 GB for a gateway, isn’t that a bit too much?” The gateway will have to ingest substantial data streams, maintain the cache of the various active agents, and potentially handle heavy parallel API querying. And besides, I can allocate these resources in my lab, so why deprive myself?

The Application Engine: NodeJS 22

At the heart of OpenClaw, the magic happens thanks to NodeJS 22. It is the execution engine that runs the gateway and its AI agent integrations.

Why is Node 22 an excellent architectural choice here? For its asynchronous management (Event Loop). When you ask OpenClaw to do a status report on 50 VMs, the gateway will initiate multiple API calls to Prism Central while keeping your WebSocket stream open to reply in real-time in the chat interface. NodeJS excels in this non-blocking concurrency management.

Network Routing: Caddy

The usual operating mode for OpenClaw is to deploy it locally on the machine from which you will connect to it, or to set up a tunnel to access the remote instance. Let’s not lie to ourselves, I wanted to type the IP in my browser and be able to access my instance, whether I’m on my PC or my tablet.

To make this possible, I use a Caddy Reverse Proxy. Caddy manages traffic routing and HTTPS encryption fully automatically.

I can already hear you saying: “Yes, but if a guy connects to your local network, he will have access to your instance!”. Well no! Because OpenClaw natively integrates a Device Whitelisting system. If your PC has never been connected to the instance, you will have to provide the “Gateway Token”. Then, you will have to accept this new connection on the OpenClaw instance side. As you can see, only previously authorized devices can enjoy your local instance.

The Entry Point: Discord

The choice of entry point, which will allow you to interact with OpenClaw, is often a matter of taste and colors.

OpenClaw directly integrates a chat system so you can talk to it. It’s good, it’s native, but inaccessible if I’m not at home. The system also offers to configure external entry points like Telegram, WhatsApp, Discord, or even Teams and Slack. And that is clearly a big plus because it gives almost unlimited possibilities!

What’s Next?

The goal of this article was to present the architecture envisioned for my OpenClaw assistant, to understand what we are deploying and why. We therefore have a coherent technical stack, performant thanks to Nutanix AHV, and hosted locally.

In a future article, I will explain how to install OpenClaw step by step until you have a functional instance.

Read More
nutanix ahv api

I’ll be honest: a while ago, development and APIs weren’t exactly my cup of tea. My playground was the console, SSH, commands typed on the fly. But with the future (and inevitable) blocking of SSH access on Prism Element and Prism Central, I had no choice: I had to get serious about it. And if I’m going to dive into the world of Nutanix APIs from my Windows PC, I might as well do it with the right tools to avoid tearing my hair out. In this article, I’ll show you how I equipped myself with the perfect tools to query Nutanix APIs.

Why Optimize Your Windows Environment for APIs?

For years, my reflex as a sysadmin facing a complex or repetitive task on Nutanix was the same: open PuTTY, connect via SSH to a Controller VM (CVM), and run ncli or acli commands on the fly. It was fast, it was efficient.

But I’ll be direct: that era is over. Nutanix is making a major security shift. SSH access to clusters will be disabled in one of the upcoming releases, relegated to the simple rank of emergency access for support. The only sustainable, supported, and scalable method to interact with your infrastructure is the API. Whether it’s the v2 API to drive a local cluster via Prism Element, or the v3 APIs on Prism Central, API automation has become the norm.

The problem? Windows hasn’t historically been the best student for handling complex web requests in the command line. PowerShell has made huge progress with Invoke-RestMethod, but when it comes to testing, debugging, and formatting nested JSON, nothing beats a solid Linux foundation coupled with a graphical API client.

That’s where our two best allies come in: WSL (Windows Subsystem for Linux) for the power of the native command line, and Postman for visual exploration of Nutanix APIs. Let’s see how to put all this together.

Solution 1: The System Foundation with WSL (Windows Subsystem for Linux)

How to avoid tearing your hair out with the quotes of a curl command under the Windows command prompt (cmd) or fighting with character escaping under PowerShell? The most elegant and robust solution today is to use the Windows Subsystem for Linux (WSL).

Deploying WSL and Ubuntu in Minutes

The installation is ultra-simple on recent versions of Windows 10 and 11. Open a PowerShell console as an administrator and simply type this magic command:

wsl --install ubuntu

Then restart your PC. You now have a functional Ubuntu distribution, fully integrated into your Windows, without the heaviness of a classic virtual machine. It’s the perfect environment to run your future Bash or Python scripts targeting the Nutanix infrastructure.

Look for the “Ubuntu” icon in the Windows start menu to launch the command prompt on the subsystem.

The Essential Packages: curl and jq

Once in your new Ubuntu terminal, you are missing two vital tools to dialogue with REST APIs: curl (the standard for forging web requests) and jq (the absolute Swiss Army knife for manipulating, filtering, and formatting JSON responses). Install them with these command lines:

sudo apt update && sudo apt upgrade
sudo apt install curl jq -y

Why is jq so critical in our line of work? Let me share a concrete field situation with you. JSON responses returned by Prism Element or Prism Central are often extremely verbose. If I simply want to retrieve the unique identifier (UUID) of my cluster via the v2 API to use it in a script, without drowning in hundreds of lines of configuration, here is the exact command I use:

curl -k -u admin:MyPassword -X GET https://<YOUR_PRISM_ELEMENT_IP>:9440/api/nutanix/v2.0/cluster | jq '.cluster_uuid'

The -k parameter is crucial here: it ignores the SSL certificate warning (which is self-signed by default on Nutanix), and the | jq '.cluster_uuid' instantly filters the raw response to return only the targeted information (in the example below: "00064a67-579d-c757-5883-002590b8ef5a"). It’s clean, neat, and perfectly integrable into a variable to automate a deployment workflow, for example.

Solution 2: The Must-Have Graphical API Client: Postman

The command line is great for running production scripts. But when it comes to exploring a new API, testing the parameters of a complex request, or analyzing the structure of a 500-line JSON payload, I prefer a graphical interface. And in this area, Postman is perfectly suited. You can download and install it in seconds from their official website.

Configuring Your First Workspace

The first mistake I made when starting with the Nutanix API (and with Postman in general), is hardcoding my IP addresses, usernames, and passwords in every request. Never do that! Not only is it tedious if you switch clusters, but it’s especially a major risk of information leakage if you share your screen or your collections.

Postman offers a vital feature: Environments. Create a new environment (e.g., “Prod Cluster”) and define three variables in it:

  • cluster_ip: The IP address of your Prism Element or Prism Central.
  • username: Your service account (avoid using the default admin account if possible).
  • password: The associated password (to be configured as “Secret” type to hide it).

From now on, in your requests, you will no longer use the raw URL, but the call to variables between double curly braces: https://{{cluster_ip}}:9440/api/nutanix/v2.0/...

The Expert’s Trick: Importing the Prism Central Swagger

Here is my real “trick” to save you hours. Nutanix APIs, particularly the v3 APIs on Prism Central, are extremely vast. Rather than creating your GET, POST, or PUT requests one by one by laboriously reading the documentation on the Nutanix.dev portal, did you know that you could suck all the configuration directly from your own cluster?

The Prism Central API exposes its OpenAPI specification (Swagger). In Postman, click on the “File > Import” button in the top left menu, choose “Link”, and simply paste this URL (replacing the IP with your Prism Central’s IP): https://<PRISM_CENTRAL_IP>:9440/static/v3/swagger.json

Let the magic happen: Postman will query your cluster and automatically generate a complete Collection containing absolutely all possible v3 API requests, preformatted with the right headers and sample payloads. It’s a massive time saver for exploration!

Test: The First API Call from Postman

Now that the tooling is ready and my variables are configured, it’s time to make the first graphical request to the cluster to retrieve its global information.

Managing Authentication and Bypassing the SSL Trap

Create a new request in Postman (+ button or New > HTTP Request). Select the GET method and enter the following URL using our variable: https://{{cluster_ip}}:9440/api/nutanix/v2.0/cluster

Before clicking “Send”, we have two settings left to make:

  1. Authentication: Go to the Authorization tab, choose the Basic Auth type. In the Username and Password fields, type {{username}} and {{password}} respectively. Postman will replace these values on the fly.
  2. The SSL Certificate: By default, Nutanix uses self-signed certificates. If you run the request now, Postman will block the call with a security error. Go to File > Settings (or the gear icon), General tab, and disable the “SSL certificate verification” option. This is the graphical equivalent of our -k parameter in curl.

Click Send. If everything is green (Status 200 OK), you should see a beautiful formatted JSON appear at the bottom, containing your cluster’s UUID, its name, its AOS version, and its virtual addresses. Congratulations, your workstation is communicating with Nutanix!

Basic Auth vs JSESSIONID

If you are just starting out, the Basic Auth method (which sends your credentials with each request) is perfect. But beware: this method has an impact.

Why? Because every time you make an API call in Basic Auth, the CVM’s Acropolis service must validate your credentials with the authentication module (and often, these credentials will be linked to an Active Directory via LDAP). If you run a script that makes 500 requests in a row to inventory VMs, you will trigger 500 identity validations. This unnecessarily saturates the CVMs and your domain controllers.

The best practice if you script massively: Authenticate only once, and use session Cookies! When you make an initial authentication request or query the API, Nutanix sends you back a cookie named JSESSIONID. Postman stores it automatically and uses it for subsequent requests in your collection. In your future Bash/Python scripts, always remember to retrieve this cookie during the first call, and pass it in the Headers of your subsequent calls. You will drastically relieve the management plane of your cluster!

Conclusion: Security Reminder and Advanced Usage

All the tools are now in place to free myself as much as possible from SSH during my next troubleshooting sessions.

I must give a fundamental security reminder. Postman allows you to export your collections to share them with your colleagues or back them up. It’s great for teamwork. But beware: if you haven’t used environment variables as we saw earlier, and you typed your passwords “hardcoded” directly in the Authorization tab of your requests, they will be exported in clear text in the collection’s JSON file.

Always ensure your “Secrets” remain in your local Environment configuration (which, by default, does not export current values with the collection). I’ve seen too many admin passwords lying around on the network because of this!

Now, all I have to do is look into the scripting and automation part to be able to develop applications that will help me drive, audit, and configure my Nutanix clusters.

But that will be the subject of a future article! Until then, happy querying to all.

Read More
GPU Nutanix AHV Linux

Integrating graphics processing power within virtualized environments has become a must. Whether it’s to run Artificial Intelligence models, Machine Learning, or simply for intensive video processing, our virtual machines increasingly need muscle.

When I talk with clients, I often get questions about this: how do you assign a physical graphics card to a VM in a simple and performant way?

Today, I suggest we look together at how to deploy an NVIDIA Tesla P4 GPU on an Ubuntu Server 24.04 VM hosted on Nutanix AHV, using “Passthrough” mode.

1. Prerequisites

Before getting our hands dirty, let’s take a moment to check our equipment. Good preparation is half the work done! I myself have lost hours in the past due to a simple forgotten prerequisite.

To follow this tutorial, you will need:

  • A Nutanix node (physical cluster) equipped with at least one NVIDIA Tesla P4 card.
  • A virtual machine running Ubuntu Server 24.04.
  • Functional SSH access to this VM with sudo privileges.

Although Nutanix AHV handles this transparently for you, keep in mind that Passthrough mode relies on specific hardware instructions. It requires that I/O virtualization extensions (VT-d for Intel or AMD-Vi for AMD) are properly enabled in the BIOS of your physical node. If you ever build a “home-lab” cluster, this is the first thing to check!

2. Nutanix Configuration: Passthrough Mode

Now that our foundation is solid, let’s move on to the administration interface. This is where the magic happens. Whether you use Prism Element or Prism Central, the logic remains the same. First, make sure your virtual machine is powered off.

Go to your VM’s settings, select “Update”, and scroll down to the “GPUs” section. Click on “Add GPU”.

In the window that opens, the choice is crucial: in the “GPU Type” drop-down menu, select Passthrough mode, then choose your Tesla P4 from the list.

Passthrough mode is a special feature: it allows you to “hand over the keys” of the physical graphics card directly to the virtual machine. The guest OS has the illusion (and the benefits) of physically owning the card.

You might be wondering why we prefer Passthrough over vGPU? It’s a matter of use case, but also architecture.

vGPU allows you to virtually slice a card to share it among several VMs, which is great for VDI, but it requires the installation and maintenance of an NVIDIA license server (vGPU Software).

Passthrough, on the other hand, dedicates 100% of the Tesla P4’s power to our Ubuntu VM, without any additional license server. For a raw single-VM performance need, it’s clearly the best option.

3. Preparation and Ubuntu 24.04 Update

Once the GPU is attached, save the configuration, power on your VM, and connect via SSH.

Before we rush into installing the NVIDIA drivers, there’s one step I absolutely never skip: updating the system.

Simply type this command:

sudo apt update && sudo apt upgrade -y

Proprietary NVIDIA drivers rely on the DKMS (Dynamic Kernel Module Support) system to compile kernel modules on the fly during installation.

If your kernel headers are not perfectly synchronized with your current Linux kernel version, the installation will fail silently. A freshly updated system is your best guarantee for a clean, hitch-free compilation!

4. NVIDIA Drivers Installation (Server Branch)

Now that our system is clean and updated, let’s move on to the main course. On Ubuntu, we often have the reflex to use the ubuntu-drivers autoinstall command or install the latest trendy “desktop” version. Let me stop you right there!

For a server, especially in production, stability is key. That’s why we are going to install the “Server” branch of the driver. Type the following command:

sudo apt install nvidia-driver-535-server -y

💡 The Expert’s Tip: Why specifically the “server” package? NVIDIA maintains specific driver branches for data centers. The “Server” branch (or Tesla driver) is designed for long lifecycles (LTS) and minimizes the risk of regression. Installing a “Desktop” driver on a hypervisor or an AI server means risking a minor update breaking your production environment on a Friday at 5 PM!

Let the installation finish (this may take a few minutes while DKMS compiles the module for your kernel). Once completed, a reboot is mandatory to properly load the driver:

sudo reboot

5. Validation

After the reboot, reconnect via SSH to your virtual machine. To verify that the OS, the driver, and the hardware are communicating perfectly, NVIDIA provides us with an essential command-line tool:

nvidia-smi

If all went well, you should see a beautiful ASCII dashboard appear.

💡 Explications: At the top right, note the CUDA Version (here 12.2): this is the maximum version of the CUDA API supported by this driver, crucial info for AI developers. Also look at the Perf column: it indicates “P0”, which corresponds to the maximum performance state (the “P-states” range from P0 to P12 for power saving). If your card stays stuck on a low P-state while under load, there’s a hardware or thermal issue! Finally, this output confirms that the OS is ready to host the NVIDIA Container Toolkit (for Docker with GPU).

Conclusion

And there you have it! In a few simple steps, we successfully physically presented our NVIDIA Tesla P4 GPU to our Ubuntu 24.04 VM under Nutanix AHV. Passthrough mode allowed us to achieve a high-performance, zero-latency configuration without the overhead of an external license server.

Our Tesla P4 is now properly installed! The next logical step? Deploying LLMs (Large Language Models) or compute-heavy applications in isolated containers. But that will be for a future article on the blog!

Read More

I will never forget the day the reality of hyperconvergence hit me. We were in the middle of an infrastructure migration. On one side, we had two full 42U racks from the 3-tier era, packed with servers and storage arrays. On the other side, to replace them, we only needed… 6U.

Two 2U Nutanix blocks (with 4 nodes in each block) and two Top of Rack switches. That was it. 84 rack units reduced to 6. The contrast was so violent it almost felt suspicious. How could such a small physical footprint replace our historic cabinets?

But make no mistake. Beneath this apparent simplicity lay a major technological rupture. We had moved from a “Hardware-Defined” era, where intelligence resided in expensive proprietary ASICs, to a “Software-Defined” era.

This void in the racks wasn’t just aesthetic. It told another story: one of exploding density, radically changing the economic equation of the datacenter. Less cooling, less floor space, less power consumption for tenfold computing power. The storage array hadn’t disappeared: it had been absorbed and virtualized by software.

The Legacy of Web Giants

To understand where this magic comes from, we have to go back to the early 2000s, far from air-conditioned enterprise server rooms, into the labs of Google and Amazon.

At that time, these giants were hitting a wall: the 3-tier model didn’t scale. To index the entire web, using traditional storage arrays like EMC or NetApp would have cost an astronomical amount. They had to find another way.

Their stroke of genius was to flip the table. Instead of buying “Premium” hardware designed never to fail (and sold at a gold price), they decided to use “Commodity Hardware”. Standard x86 servers, cheap, almost disposable.

The philosophy changed completely: hardware will fail. It is a statistical certainty. Rather than fighting this reality with redundant components, they decided to manage failure at the software level.

For purists and tech historians, the founding moment is captured in a PDF document published in October 2003: The Google File System (SOSP’03). This research paper is the bible of modern infrastructure. It describes a system where thousands of unreliable hard drives are aggregated by intelligent software that ensures resilience. If a drive dies? The system doesn’t care. No need to rush to replace the disk at 3 AM. The software has already replicated the data elsewhere.

Hyperconvergence is simply the arrival of this “Web Scale” technology, packaged and democratized for our enterprises.

Anatomy of an HCI Node: How Does It Work?

Concretely, what changes at the hardware level? In a hyperconverged infrastructure, we no longer separate Compute and Storage. Everything is reunited in the same chassis, called a “Node”.

Each node contains its own processors, RAM, and its own disks (SSD, NVMe, HDD). But unlike a classic server, these disks aren’t just for installing the local OS. They are aggregated with the disks of other nodes in the cluster to form a global storage pool.

This is where the real revolution comes in: the CVM (Controller VM).

Imagine taking the physical controllers of your old SAN array (the compute part) and turning them into software. On each physical server in the cluster, a special virtual machine (the CVM) runs permanently. It is the conductor.

For the technical expert, the feat lies in hardware management. The hypervisor (ESXi or AHV) does not manage the storage disks. Thanks to a technology called PCI Passthrough (or I/O Passthrough), the CVM bypasses the hypervisor to speak directly to the disks. Result: raw performance without the classic virtualization overhead.

The Strengths of Hyperconvergence

Beyond the hype, three technical arguments have hit the mark in enterprises.

1. Scale-Out (The LEGO Approach)

Gone is the headache of 5-year sizing. With 3-Tier, when the array was full, it was panic mode (Scale-Up). With HCI, if you need more resources, you buy a new node and plug it in. The cluster automatically absorbs the new CPU power and storage capacity. It is linear and predictable growth.

2. Data Locality

This is the Holy Grail of performance. In a classic architecture, data had to cross the SAN network to reach the processor. With HCI, software intelligence ensures that data used by a VM is (whenever possible) stored on the disks of the physical server where it is running. The path is near-instantaneous. The network is no longer a bottleneck.

3. Distributed Rebuild (Many-to-Many)

This is often the argument that finally convinces administrators traumatized by RAID rebuilds. On a classic array (RAID 5 or 6), if a 4TB drive breaks, a single “hot spare” drive has to rewrite everything. This can take days, during which performance collapses. In HCI, data is replicated in chunks all over the cluster. If a drive dies, all other disks in all other nodes participate simultaneously in reconstructing the missing data. We move from a “1 to 1” problem to a “Many to Many” solution. Result: resilience is restored in minutes.

The Weaknesses: What Marketing Forgets to Mention

If hyperconvergence seems magical, it is not without flaws. As an expert, it is crucial to understand the trade-offs of this architecture.

The first is the “CVM Tax”. Intelligence isn’t free. Since the storage controller is now software, it consumes CPU and RAM resources that are no longer available for your applications. On very small clusters, reserving 20GB or 24GB of RAM per node just to “run the shop” can seem heavy, even if it is the price of peace of mind.

The second technical limitation is the critical dependence on “East-West” network traffic. In a 3-Tier array, replication traffic remained confined within the array. In HCI, to secure data (RF2 or RF3), the CVM must write it locally but also immediately send it over the network to another node. If your 10/25 GbE network is unstable or poorly configured, the entire performance and stability of the cluster collapses. The network is no longer a simple commodity; it is the nervous system of your cluster. I repeat it to every client: an HCI cluster is 80% network. If your network has a problem, your HCI cluster has a problem.

Nutanix, The Pioneer

Hyperconvergence marked the end of an era. It proved that software could supplant specialized hardware, transforming our rigid datacenters into agile private clouds.

But an idea, however brilliant (like the Google File System), is useless if it remains confined to a research lab. Someone had to take these complex concepts and make them accessible to any system administrator in less than an hour.

That is where Nutanix comes in.

Founded by former Google employees who had worked on GFS, this company created NDFS (Nutanix Distributed File System). They succeeded in the crazy bet of running a “Google-type” infrastructure on standard Dell, HP, or Lenovo servers.

How did Nutanix manage to become the undisputed leader of this market, surviving even the assault of VMware with vSAN? That is what we will dissect in the next article of this series.

Read More

Let’s be honest: shutting down a complete Nutanix cluster is always a bit stressful. Even after 15 years in the business. Why? Because even with the best HCI technology on the market, cutting the power on an IT infrastructure is never trivial.

I’ve seen too many “cowboys” pull the plug or perform a brutal “Shutdown” via IPMI, thinking data resiliency would handle the rest. Spoiler alert: this often ends with Level 3 Nutanix support on the line to recover corrupt Cassandra metadata or with the loss of one or more disks.

This guide is my lifeline to ensure my cluster restarts without issues. No GUI, no Prism Element for the critical steps. We open the terminal, connect via SSH, and do it properly.

Phase 1: Health Checks

Before even thinking about stopping a single VM, you must ensure the cluster is capable of stopping (and more importantly, restarting). If your cluster is already suffering, shutting it down is not always a good option.

1.1 SSH Connection to the CVM

Open your favorite terminal (PuTTY works just fine) and connect via SSH to the cluster’s virtual IP address (Cluster VIP) with the user nutanix.

1.2 Nutanix Cluster Checks (NCC)

To ensure the cluster is healthy, it is necessary to run an NCC. Run a full check:

ncc health_checks run_all

My advice: Don’t just skim through the report. If you have a “FAIL” on Cassandra, Zookeeper, or Metadata, STOP. Fix it before shutting down. A warning about a full disk or an old NTP alert is acceptable. But data integrity is non-negotiable.

1.3 Resiliency Verification

The Prism dashboard is pretty; it tells you “Data Resiliency Status: OK”. That’s good, but it’s not precise enough for a total shutdown. I want to know if my data is truly synchronized, right now.

Type this command and look it in the eye:

ncli cluster get-domain-fault-tolerance-status type=node

What you need to see: A line indicating Current Fault Tolerance: 2 (or 1 depending on your RF configuration).

If you see a state indicating a rebuild in progress, do not shut down the cluster and wait for the rebuild to finish.

Phase 2: Shutting Down Workloads

Once the cluster is validated as healthy, we move on to the virtual machines. The classic mistake is rushing to stop the nodes, but this will be refused if virtual machines are still running on the cluster.

2.1 The Battle Order

Start by shutting down your test/dev environments, then application servers, and finally databases. It’s common sense, but it’s always good to be reminded.

Once all production machines are off, you can now shut down the remaining “tooling” VMs of your infrastructure: AD, DNS, firewalls…

2.2 Managing Prism Central

Connect to Prism Central via SSH with the nutanix account, then run the stop command:

cluster stop

Wait for the PCVM services to stop and verify that the cluster is indeed stopped:

cluster status

If all services are stopped and the cluster status is “stop”, we can now proceed to shut down the PCVM:

sudo shutdown -h now

Phase 3: Stopping Nutanix Services (“Cluster Stop”)

Your VMs and Prism Central are off. Your hosts are running nothing but CVMs (Controller VMs). This is the critical moment. We never perform an OS shutdown of the CVMs without first stopping the cluster services properly.

Why? Because a brutal shutdown of CVMs can lead to data corruption or metadata inconsistencies that might require support intervention.

3.1 Stopping the Cluster

Reconnect to your Nutanix cluster VIP and simply type:

cluster stop

The system will ask for confirmation before launching operations. Type Y.

This command orders each CVM to stop its services in a precise order. The Stargate service (which handles storage I/O) ensures everything is “flushed” to disk before shutting down.

You will see lines scrolling by indicating the stop of Zeus, Scavenger, Cassandra, etc. Be patient. Depending on the cluster size, this can take 2 to 5 minutes.

3.2 Verification

Once the operation is complete, check the actual state of services:

cluster status

What you need to see: A list of services for each CVM. They must all be in the DOWN state, with the potential exception of the Genesis service which may remain UP; this is normal.

If you see other services still UP, wait a minute and run the check again. Do not proceed until the cluster is logically fully stopped.

Phase 4: Shutting Down CVMs and Physical Nodes

We are at the end of the tunnel. The cluster is logically stopped. Only empty shells remain: the CVMs (which are Linux VMs, let’s not forget) and the hypervisors.

4.1 Stopping CVMs

You must now connect to each CVM individually (via its IP, no longer via the VIP) and run the shutdown command.

The official command:

cvm_shutdown -P now

The cvm_shutdown command contains specific hooks to notify the hypervisor. Repeat the operation on each node of the cluster.

4.2 Stopping Hypervisors

Once the CVMs are off, connect to your hosts (via SSH or IPMI) and on each of them type the following command:

shutdown -h now

The Expert Nugget: The Automation Script ⚡

Do you have a 16-node cluster and don’t feel like connecting 32 times (16 CVM + 16 Hosts)? I get it.

Here is a script to run from any CVM in the cluster that will shut down all CVMs, then all AHV hosts.

⚠️ WARNING: This script asks no questions. Ensure you have validated Phase 3 (cluster stop) before launching this, otherwise, a crash is guaranteed.

The “Kill Switch” Script (For AHV)

From a CVM, this script retrieves the IPs of other CVMs and hosts, then sends the shutdown order in sequence.

for svmip in `svmips`; do ssh -q nutanix@$svmip "sudo /usr/sbin/shutdown +1 ; hostname"; done
for hostip in `hostips`; do ssh -q root@$hostip "/usr/sbin/shutdown +3 ; hostname"; done
  • The first command orders the shutdown of CVMs after a one-minute delay.
  • The second command orders the shutdown of nodes after a 3-minute delay.

Once you have launched the commands, you will lose connection after one minute. You can then monitor the shutdown of your nodes from their respective IPMI interfaces.

Phase 5: Powering Back Up (Cold Boot)

The maintenance period is over. What do we do? Press ON and pray? No, we follow the reverse order.

  1. Physical Network: Turn on your Top-of-Rack switches first. If the network isn’t there, the nodes won’t see each other upon booting.
  2. IPMI / Physical: Turn on the physical nodes.
  3. Patience: AHV will boot, then automatically start the CVM.
    • Tip: Don’t touch anything for 10 minutes. Let the CVMs form the cluster.
  4. Starting the Cluster: Connect via SSH to a CVM. Verify that all CVMs are up (svmips should list them all). Then:cluster start
  5. Verify that the cluster has started properly with the command:cluster status
  6. Starting Workloads: Once the cluster is UP, power on the PCVM first, then your VMs (Infra first, Appli second).

Conclusion

Shutting down a Nutanix cluster is a simple procedure but requires good sequencing. It’s not complicated, but it doesn’t forgive impatience. If you follow these steps, you’ll sleep soundly during the power outage.

Read More
nutanix centreon supervision snmp

We’ve all been there. That moment when your monitoring dashboard shows a beautiful green circle for your Nutanix cluster, while in reality, one of the nodes is struggling. That’s exactly what happened to me recently.

When I integrated Nutanix into my infrastructure, my first instinct was to pull out Centreon. Why? Because it’s my Swiss Army knife for monitoring. But I quickly realized that the “standard” method of adding a cluster locks us into an illusion of security. We see the “whole,” but we miss the “detail.”

In this feedback report, I’ll share my experience with Nutanix Centreon monitoring and explain why you should stop monitoring your cluster solely through its Virtual IP (VIP) and switch to a granular node-by-node strategy.

Why the “default” configuration left me wanting more

When installing the Nutanix Plugin Pack on Centreon, the documentation naturally guides you toward adding a single host representing the cluster.

How the standard Nutanix Plugin Pack works

The classic method involves querying the cluster’s Virtual IP (VIP) or the IP of one of the CVMs (Controller VM). It’s simple and fast: you enter the SNMP community, apply the template, and the services appear. You then monitor global CPU usage, average storage latency, and the general status reported by Prism.

The “Black Box” problem

This is where the trouble starts. By querying only the VIP, you are actually querying an SNMP agent that aggregates data. If you have a 3-node cluster, the monitoring will tell you that the cluster-wide memory is “OK.” But what about the memory load on node #3?

This is what I call the “black box” effect. Nutanix’s Shared Nothing architecture is a strength for resilience, but it can become a blind spot for monitoring if you don’t drill down to the physical layer. For an expert, knowing the cluster is “Up” is not enough; we need to know which specific physical component requires intervention before redundancy is compromised.

Decoupling monitoring for granular visibility

To break out of this deadlock, I changed my approach: treating each node as its own entity in Centreon. Here’s how I did it.

Step 1: Setting the stage on Prism Element

Before touching Centreon, you must ensure Nutanix is ready to talk. Head to Prism Element, in the SNMP settings. Here, I configured SNMP v2c access (or v3 if you want to max out security).

Check out my dedicated articles if you need details on how to configure SNMP v2c or SNMP v3 on your Nutanix cluster.

Step 2: The “Node by Node” addition strategy in Centreon

This is where the magic happens. Instead of creating a single “Cluster-Nutanix” host, I created as many hosts as I have physical nodes (e.g., cluster-2170_n1, cluster-2170_n2, etc.).

Host Configuration: Each host points to the cluster’s VIP IP address or the specific node’s CVM IP. By default, this will pull the same global information, but stay tuned.

Applying Templates: I apply the Virt-Nutanix-Hypervisor-Snmp-Custom template.

Surgical Filtering: This is the key secret. In the “Host check options,” I apply the custom macro FILTERNAME. This allows me to specify the exact name of the host to monitor. The plugin then filters the SNMP data sent by the VIP to return only what concerns my specific node.

Step 3: The trick to maintaining Cluster consistency

To keep an overview, I use Host Groups in Centreon. I created a group named HG-Cluster-Nutanix-Prod containing my 3 nodes. This allows me to create aggregated dashboards while keeping the “drill-down” capability (clicking to see details) for each physical machine.

Immediate benefits: Dashboarding and Peace of Mind

Since I switched to this configuration, my daily life as a sysadmin has radically changed:

Granular performance analysis: I can now identify a node consuming abnormally more RAM or CPU than its neighbors. It’s the perfect tool for detecting a “hot point” or a VM distribution issue.

Increased responsiveness: When something goes wrong, Centreon sends me an alert with the specific node name (n1, n2, etc.). No more guessing games in Prism Element to find out where to focus my search.

Clean history: I have metric graphs per physical server, which greatly facilitates Capacity Planning and troubleshooting.

Conclusion

If you manage Nutanix, don’t settle for the superficial view offered by the VIP IP alone. By taking 10 minutes to declare your hosts individually in Centreon with the FILTERNAME macro, you move from “passive” monitoring to a true control tower.

My verdict is clear: node-level monitoring is the only way to guarantee true high availability and sleep soundly at night.

Read More

I still remember my first time entering a “serious” server room back in the mid-2000s. What struck me wasn’t so much the deafening roar of the air conditioning, but the physical density of the infrastructure.

Back then, to run a few hundred virtual machines, you didn’t just need “a cluster.” You needed entire rows. Power-hungry Blade Centers, monstrous Fibre Channel switches with their characteristic orange cables, and above all, sitting in the center of the room like a sacred totem: the Storage Array. Entire cabinets filled with 10k RPM mechanical disks, weighing as much as a small car and consuming as many ‘U’ (rack units) as possible.

This is what we call the 3-Tier architecture. While Hyperconvergence (HCI) and Public Cloud seem to be the norm today, it is crucial to understand that 3-Tier was the backbone of enterprise IT for nearly 20 years. To understand this architecture is to understand where we come from, and why we sought to change it.

In this article, the first in a series that will present the evolution of 3-tier virtualization infrastructures towards Nutanix hyperconverged infrastructures, we will factually dissect this standard: how it works, why it dominated the market, and the technical limits that eventually rendered it obsolete for modern workloads.

Genesis: Why Did We Build It This Way?

To understand 3-Tier, you have to go back to the pre-virtualization era. A physical server hosted a single application (Windows + SQL, for example). It was the “Silo” model. Inefficient, expensive, and a nightmare to manage.

Virtualization (led by VMware) arrived with a promise: consolidate multiple virtual servers onto a single physical server. But for this magic to happen, there was an absolute technical condition: mobility.

For a VM to move from physical server A to physical server B without service interruption (the famous vMotion), both servers had to see exactly the same data, at the same moment.

This is where the architecture split into three distinct layers:

  1. We removed the disks from the servers (which now only do computing).
  2. We centralized all data in external shared storage (the Array).
  3. We connected everything via a dedicated ultra-fast network (the SAN).

It was a revolution: the server became “disposable,” or at least interchangeable, because it no longer held the data. But this centralization created a single point of complexity and performance: shared storage. It is the heart of the reactor, but also its Achilles’ heel.

The Anatomy of 3-Tier: Decoupling the Layers

If we were to draw this architecture, it would look like a three-layer cake, where each layer speaks a different language.

1. The Compute Layer

At the very top, we have the physical servers (Hosts). They run the hypervisor (ESXi, Hyper-V, KVM). Their role is purely mathematical: providing CPU and RAM to the virtual machines.

These servers are “Stateless”. They store nothing persistent. If a server burns out, it doesn’t matter: we restart the VMs on its neighbor (HA).

This logic was pushed to the extreme with “Boot from SAN”. We even ended up removing the small local disks (SD cards or SATA DOM) that contained the hypervisor OS so that the server was a total empty shell, loading its own operating system from the distant storage array. A technical feat, but a nightmare in case of SAN connectivity loss.

2. The Network Layer (SAN)

In the middle sits the Storage Area Network. It is the highway that transports data between the servers and the array. Historically, this didn’t go through classic Ethernet (too unstable at the time), but through a dedicated protocol: Fibre Channel (FC).

It is a deterministic and lossless network. Unlike Ethernet which does “best effort,” FC guarantees that packets arrive in order.

If you have administered SAN, you know the pain of Zoning. You had to manually configure on the switches which port (WWN) was allowed to talk to which other port. A single digit error in a 16-character hexadecimal address, and your production cluster would stop dead. It was a task so complex that it often required a dedicated team (“The SAN Team”).

3. The Storage Layer

At the very bottom, the Storage Array. It is a giant computer specialized in writing and reading blocks of data. It contains controllers (the brains) and disk shelves (the capacity).

The array aggregates dozens or even hundreds of physical disks to create large virtual volumes (LUNs) that it presents to the servers. It ensures data protection via hardware RAID.

All the intelligence resides in two controllers (often in Active/Passive or Asymmetric Active/Active mode). This is an architectural bottleneck: no matter if you have 500 ultra-fast SSDs behind them, if your two controllers saturate in CPU or cache, the entire infrastructure slows down. This is called the “Front-end bottleneck”.

The Strengths: Why This Model Ruled the World

It’s easy to criticize 3-Tier with our 2024 eyes, but we must recognize that it brought incredible stability.

  1. Robustness and Maturity: This is hardware designed never to fail. Storage arrays have redundant components everywhere (power supplies, fans, controllers, access paths). We talk about “Five Nines” (99.999% availability).
  2. Fault Isolation: If a server crashes, the storage lives on. If a disk breaks, hardware RAID rebuilds it without the server even noticing (or almost).
  3. Scale-Up Independence: This was the king argument. Running out of space but your CPUs are idling? You just buy an extra disk shelf. Running out of power but have plenty of space? You add a server. You could size each tier independently.

The Weaknesses: The Other Side of the Coin

Despite its robustness, the 3-Tier model began to show serious signs of fatigue in the face of modern virtualization. For us admins, this translated into shortened nights and a few premature gray hairs.

Operational Complexity

The greatest enemy of 3-tier is not failure, it’s the update. Imagine having to update your hypervisor version (ESXi). You can’t just click “Update.” You have to consult the HCL (Hardware Compatibility List). Is my new HBA card driver compatible with my Fibre Channel switch firmware, which itself must be compatible with my storage array OS version? It’s a house of cards. I’ve seen entire infrastructures become unstable simply because a network card firmware was 3 months behind the one recommended by the array manufacturer.

The Bottleneck (The “I/O Blender Effect”)

This is a fascinating and destructive phenomenon. Imagine 50 VMs on a host.

  • VM 1 writes a large sequential file.
  • VM 2 reads from a database.
  • VM 3 boots up.

At the VM level, operations are clean. But when all these operations arrive at the same time in the storage controller funnel, they get mixed up. What was a nice sequential write becomes a slush of random writes (Random I/O). Traditional array controllers, originally designed for single physical servers, often collapse under this type of load, creating latency perceptible to the end user.

The Hidden Cost

Finally, 3-Tier is expensive. Very expensive.

  • Licensing & Support: You pay for server support, SAN switch support, and array support (often indexed to data volume!).
  • Footprint: As mentioned in the introduction, this equipment consumes enormous amounts of space and electricity.
  • Human Expertise: It often requires a team for compute, a team for network, and a team for storage. Incident resolution times explode (“It’s not the network, it’s storage!” – “No, it’s the hypervisor!”).

Conclusion: A Necessary Foundation

The 3-Tier architecture is not dead. It remains relevant for very specific needs, like massive monolithic databases that require dedicated physical performance guarantees.

However, its management complexity and inability to scale linearly paved the way for a new approach. We started asking the forbidden question: “What if, instead of specializing hardware, we used standard servers and managed everything via software?”

It was this reflection that gave birth to Software-Defined Storage (SDS) and Hyperconvergence (HCI). But that is a topic for our next article.

Read More

It’s one of those mornings where the coffee tastes a little different. The taste of major announcements that are bound to change our habits as administrators. Nutanix has just released a trio of major updates into the wild: AOS 7.5, AHV 11.0, and Prism Central 7.5.

Let’s be clear from the start: I’ve combed through the Release Notes for you, and this isn’t just a simple “Patch Tuesday.” It is a structural overhaul. Nutanix is no longer content with just improving its HCI; the vendor is breaking its own dogmas (hello external storage and compute-only nodes) and drastically tightening security, even if it shakes up our old reflexes.

While on paper, the promises of performance (AES everywhere) and flexibility (Elastic Storage) are enticing, my field experience dictates a certain prudence. When you mess with the storage engine and SSH access at the same time, you don’t rush into production without reading the fine print carefully. That is exactly what I’m proposing here: an unfiltered technical analysis of what awaits you.

AOS 7.5: Performance & Architecture

Let’s start with the core of the reactor: AOS 7.5. If you thought the Nutanix storage architecture was set in stone, think again. This version marks a turning point in hot data and disk space management.

The Key Concept: AES Becomes the Absolute Standard

Until now, the Autonomous Extent Store (AES) was often reserved for high-performance All-Flash environments. With 7.5, that’s over: AES becomes the default architecture for all deployments, whether All-Flash or Hybrid.

Why is this important? Because AES improves metadata locality and reduces CPU consumption for I/O. But be careful, the critical novelty here is the automatic migration. If you upgrade an existing hybrid cluster to 7.5, AOS will launch a background conversion task to switch to AES.

Do not underestimate the I/O impact of this “transparent” conversion. Even if Nutanix handles it in the background, metadata restructuring is never trivial on a loaded cluster. Furthermore, Nutanix introduces a revamped Garbage Collection (GC) (“Accelerated Data Reclamation”). It is now capable of cleaning multiple “holes” in an Erasure Coding stripe in a single pass and merging inefficient stripes. It’s brilliant for efficiency, but it confirms that the engine is working much more “intelligently” under the hood.

The Unexpected Opening: Pure Storage and Dense Nodes

This might be the strongest sign of this release: Nutanix is officially opening up to third-party storage. AOS 7.5 supports connecting to Pure Storage FlashArray arrays via NVMeoF/TCP for capacity storage. Nutanix handles the compute, Pure handles the data. For HCI purists like me, this is a paradigm shift, but one that meets a real need for disaggregation.

Finally, for those managing storage monsters, note that existing All-Flash nodes can be upgraded to support up to 185 TB per node, while maintaining aggressive RPOs (NearSync/Sync).

AHV 11.0 & Flexibility: The Era of “Compute-Only” and Elastic Storage

If AOS 7.5 boosts the engine, AHV 11.0 changes the bodywork. For a long time, Nutanix preached the dogma of strict hyperconvergence: “You buy identical nodes, you expand storage and compute at the same time.” With this version, I feel like Nutanix is finally listening to those who, like me, found themselves with too much CPU and not enough disk (or vice versa).

The Key Concept: Official Disaggregation

It’s a small revolution: Nutanix now allows the deployment of “Compute-Only” nodes much more flexibly. We are seeing the arrival of a standalone AHV installer. Concretely, you can manually install AHV via an ISO on a server, without going through the heaviness of a full re-imaging via Foundation.

For labs or rapid compute power expansions, this is a phenomenal time-saver. But be careful, this requires increased rigor regarding hardware compatibility management, as Foundation will no longer be there to act as a safeguard during installation.

The Awaited Feature: Elastic VM Storage

This is undoubtedly the feature I was waiting for the most to break down silos. With Elastic VM Storage, available starting with AHV 11.0 and AOS 7.5, you can finally share a storage container from one AHV cluster to another AHV cluster within the same Prism Central.

Imagine: your Cluster A is bursting at the seams storage-wise, but your Cluster B is sleeping half-empty. Before, you had to move VMs. Now, you can mount the container from Cluster B onto Cluster A and deploy your VMs directly on it.

It’s great, but caution. It’s not magic. You are introducing a critical network dependency between two clusters that were previously isolated. If your inter-cluster network fails, the VMs on Cluster A hosted on Cluster B go down. Moreover, Nutanix clearly states that this allows “serving storage from a remote cluster,” which necessarily implies additional network latency compared to native data locality. Reserve this for workloads that are not sensitive to disk latency or for temporary overflow.

Finally, note the arrival of Dual Stack IPv6. AHV can now talk to your DNS, NTP, and Syslog servers in IPv6. A necessary update to align with modern network standards.

Security and Governance: Locking Everything Down (SSH, vTPM, Profiles)

Let’s move on to the part that will make command-line regulars (myself included) grind their teeth. Nutanix has decided to tighten the screws on security, and they aren’t kidding around.

The Key Concept: The Digital Fortress

The goal is clear: reduce the attack surface, especially against ransomware that often attempts to propagate via lateral movements on management interfaces. Nutanix is therefore introducing mechanisms to limit direct human access to infrastructure components (CVM and Hosts).

The Critical Change: CVM Secure Access (The End of SSH is coming)

This is the number one vigilance point of this article. With AOS 7.5, you now have the option (and strong incentive) to totally disable SSH access to CVMs and AHV hosts.

On paper, this is excellent for security (“Security by Obscurity”). In operational reality, it is a violent cultural change. No more quick ssh nutanix@cvm to check a log or run a quick diagnostic script. Everything must go through APIs or the console.

Danger Warning! Before checking that “Disable SSH” box, check your migration procedures. The Release Notes are formal: disabling SSH breaks Cross-Cluster Live Migration (CCLM) workflows, whether in On-Demand mode (OD-CCLM) or Disaster Recovery (DR-CCLM). These operations still rely on SSH tunnels between source and destination hosts. If you cut SSH, your migrations will fail. You will have to re-enable SSH to make them work. This is a major operational constraint to anticipate.

Governance: vTPM & Guest Profiles

For highly sensitive environments, AHV now supports storing vTPM encryption keys in an external KMS. This allows centralizing key management and aligning the vTPM security policy with the cluster’s “Data-at-Rest” encryption policy.

On the quality of life side, I welcome the arrival of reusable Guest Customization Profiles. No more tedious copy-pasting of Sysprep scripts with every VM clone. You create a profile (Windows + NGT 4.5 min required), store it, and apply it on the fly to clones or templates. It’s simple, efficient, and avoids input errors.

Prism Central 7.5: The Interface That Makes Life Easier (NIM & Policies)

We finish this overview with Prism Central 7.5 (pc.7.5). If AOS is the engine and AHV the chassis, PC is the dashboard. And believe me, it is fleshing out considerably to save us from ungrateful manual tasks.

The Key Concept: Intelligent Orchestration

The major addition is the arrival of VM Startup Policies. This is a feature I’ve been waiting for for years to replace my cobbled-together startup scripts. Concretely, you can now define the exact restart order of VMs during an HA event (node failure) or a cluster restart.

This allows managing application dependencies cleanly: “Start the Database, wait for it to be UP, then start the Application Server”. It’s native, integrated into the interface, and greatly secures recovery plans.

For large-scale environments, note the appearance of NIM (Nutanix Infrastructure Manager). It is a new orchestrator designed to provision, configure, and manage your datacenters in a standardized way, aligning with the famous “Nutanix Validated Designs” (NVD). It is clearly oriented for very large deployments that want to avoid configuration drift.

Enhanced Resilience: PC Backup & Restore

Until now, restoring a crashed Prism Central could be an adventure, especially if the original cluster was itself down. Nutanix has lifted a major technical constraint: you can now recover a Prism Central instance from a backup located on any Prism Element cluster.

This is a detail that changes everything in case of a total site disaster. Previously, recovery from a Prism Element backup was restricted to the specific cluster where PC was registered. This new flexibility, coupled with the ability to backup to a generic S3 Object Store, makes the management architecture much more robust. We are no longer putting all our eggs in one basket.

Conclusion & Recommendations: Maturity Has a Price

After dissecting these three release notes, my feeling is clear: Nutanix is reaching an impressive level of maturity. The generalization of AES and the opening to external storage show that the platform is ready for the most demanding workloads and the most complex architectures.

However, as a “Prudent Ghost Writer,” I must raise a final red flag before you click “Upgrade.”

⚠️ Watch out for prerequisites: Do not rush headlong into the Prism Central update. Version pc.7.5 requires your Prism Element clusters to run at least AOS 7.0.1.9. If you are on an earlier version, deployment will be blocked. You will have to plan your migration path rigorously.

This is an unavoidable update for the performance and security gains, but it is also a structural update. The AES conversion, the potential SSH deactivation, and the new network dependencies for elastic storage require validating these changes in a pre-production environment.

Take the time to test, check your compatibility matrices, and above all, do not cut SSH before verifying that you do not have any planned inter-cluster migration (CCLM)!

To your keyboards, and happy upgrading!

Read More