Hyper-Converged Infrastructure 超融合基礎設施

Phison HCI Phison HCI

Phison Hyper-Converged Infrastructure software is the core architecture of the next-generation AI Data Platform. It integrates computing, storage, GPU resources, and an AI management platform to provide enterprises with a one-stop AI infrastructure solution. 群聯超融合基礎設施（HCI）軟體是下一代 AI 資料平台的核心架構。它整合運算、儲存、GPU 資源與 AI 管理平台，為企業提供一站式 AI 基礎設施解決方案。

90 90 %↑ %↑ GPU Utilization GPU 利用率

vGPU partitioning and intelligent scheduling lift average GPU utilization from 25% to over 90%, turning idle capacity into active inference and improving hardware ROI by 3–4×. 透過 vGPU 切割與智慧排程，GPU 使用率從平均 25% 提升至 90% 以上，將閒置算力轉化為推論產出，硬體 ROI 翻 3–4 倍。

60 60 %↓ %↓ Lower TTFT 更低首字延遲

Distributed KV-Cache sharing achieves an 80% hit rate across the cluster, eliminating redundant prefill computation and cutting time-to-first-token by up to 60%. 分散式 KV-Cache 共享讓叢集命中率達 80%，消除重複 Prefill 計算，首字延遲（TTFT）最多降低 60%。

0 0 Zero-Downtime Operations 零停機維運

Built-in health monitoring and automatic failover redirect traffic within seconds when a node goes offline, enabling rolling maintenance with zero service interruption. 內建健康監控與自動容移（Failover），節點離線時流量在數秒內重導，支援滾動維護，服務零中斷。

Enterprise AI Infrastructure Challenges 企業 AI 基礎設施的核心障礙

Before adopting Phison HCI, enterprises deploying Private AI must first overcome the following fundamental challenges. 在採用 Phison HCI 之前，部署 Private AI 的企業必須先克服以下根本性挑戰。

Persistently Low GPU Utilization GPU 閒置率居高不下

Full-card or passthrough deployment often reaches only 20–30% GPU utilization, leaving compute underused and hardware ROI extremely low. 整卡或直通部署平均 GPU 使用率僅 20–30%，算力大量閒置，硬體投資 ROI 極低。

Limited LLM Context Length LLM 上下文長度受限

GPU HBM often cannot hold the KV Cache large models need; long documents and conversations degrade quickly or fail to complete. GPU HBM 常不足以承載大型 KV Cache；長文件與長對話場景效能急降或無法完成推論。

Lengthy Deployment Cycles 部署週期冗長

From procurement and networking to containers and model launch, traditional flows often take weeks or months, slowing AI innovation. 從採購、網路到容器平台與模型上線，傳統流程常需數週至數月，嚴重拖慢 AI 創新。

Fragmented Multi-System Management 多系統管理破碎化

Compute, storage, network, containers, and monitoring are managed separately—IT teams operate across five or more systems, raising labor cost and config drift risk. 計算、儲存、網路、容器與監控分散管理，IT 需跨五套以上系統，人力成本高且易組態漂移。

What Core Technology Does Phison Own? 群聯擁有哪些核心技術？

Phison HCI builds on three self-developed technologies — vGPU partitioning and time-sharing, aiDAPTIVCache tiering from HBM to NVMe, and multi-node tensor/pipeline parallel scale-out — to eliminate GPU idle waste, extend effective memory across the cluster, and run large-model inference at production scale. 群聯 HCI 以三項自研技術為基礎——vGPU 切割與分時共享、aiDAPTIVCache 從 HBM 到 NVMe 的分層快取、以及多節點 Tensor/Pipeline 平行擴展——消除 GPU 閒置、延伸叢集有效記憶體，支撐大型模型量產級推論。

vGPU Partitioning + Time-Sharing vGPU 切割 + 分時共享

Split a single GPU into vGPU instances with on-demand compute and memory allocation. Multiple models or tenants time-share the same card with QoS isolation, and quotas adjust dynamically at peak load. 單卡切割為 vGPU 實例，按需分配算力與顯存；多模型、多租戶分時共享並確保 QoS 隔離，尖峰期動態調整配額。

KV Cache Extension Across Multiple Nodes KV Cache 擴充多節點

aiDAPTIVCache tiers cache from GPU HBM to NVMe SSD, expanding effective memory 10×+. KV Cache shares across nodes so Prefill results reuse across Decode workloads — supporting 128K+ token inference without OOM. aiDAPTIVCache 自 GPU HBM 分層至 NVMe SSD，有效記憶體延伸 10 倍以上；KV Cache 跨節點共享，Prefill 結果供多 Decode 節點重用，支撐 128K+ tokens 推論。

Multi-Node Scale-Out Architecture 多節點橫向擴展架構

Tensor Parallel + Pipeline Parallel split 70B–405B models across nodes. Cross-node KV Cache sharing cuts inter-node traffic as throughput scales linearly with each node added. Tensor Parallel + Pipeline Parallel 切分 70B–405B 模型；跨節點 KV Cache 共享降低跨機通訊，新增節點即可線性提升吞吐量。

Phison HCI Architecture 群聯 HCI 架構

Through software-hardware integration and modular design, enterprises can quickly deploy AI workstations, Private AI, AI agents, RAG, AI inference, and Edge AI applications, lowering adoption barriers and accelerating AI implementation. 透過軟硬體整合與模組化設計，企業可快速部署 AI 工作站、Private AI、AI 代理、RAG、AI 推論與 Edge AI 應用，降低導入門檻並加速 AI 落地。

User Surfaces 使用者介面

User Login 使用者登入

AI Workspace AI 工作區

Applications 應用程式

Compute 運算主控台

Storage 儲存主控台

Management 管理主控台

Platform Services 平台服務

AI Platform AI 平台

On-prem model upload 本地模型上傳

OCI Artifacts support OCI Artifacts 支援

Rapid model deployment 快速模型部署

Performance monitoring 效能監控

Service Deployment & Management 服務部署與管理

Scheduling & orchestration 排程與協調

Backend services 後端服務

Compute Resources 運算資源

CPU · RAM CPU · RAM

GPU & vGPU GPU 與 vGPU

aiDAPTIVCache aiDAPTIVCache

Cluster 叢集
Container 容器
VM 虛擬機
Multi-Tenancy 多租戶
Access Control 存取控制
Audit 稽核
Cost 成本
Image 映像檔
Monitoring 監控

Phison HCI Software 群聯 HCI 軟體

Phison hyper-converged software that unifies heterogeneous GPU, XPU, storage, and VM resources under a single control plane — pooling mixed hardware to maximize AI infrastructure ROI. 群聯超融合軟體，將異質 GPU/XPU、儲存與虛擬機資源統整於單一控制平面，整合混合硬體以最大化 AI 基礎設施投資價值。

Software Deployment & Container Image Management 軟體部署與容器映像管理
CPU / GPU / VM Resource Management CPU / GPU / VM 資源管理
vGPU vGPU
Security & Access Control 安全性與存取控制
Application Marketplace 應用程式市集
Storage Management 儲存管理
Cost Management 成本管理
Monitoring, Observability & Alerting 監控、可觀測性與警報

Unified Management Console 統一管理控制台

One control plane for Kubernetes, storage, VMs, AI inference, and monitoring — operate your entire AI infrastructure without switching tools. 單一控制平面管理 Kubernetes、儲存、虛擬機、AI 推論與監控 — 無需切換工具即可運維整體 AI 基礎設施。

ai-cluster production Application Hub

Application Hub

Browse and install Helm charts

Reload Upload

Chart	Version	App Version	Repository	Actions
vllm	0.6.2	v0.6.2	otterscale-charts	View Install
llm-inference	1.2.0	1.2.0	otterscale-charts	View Install
prometheus-stack	55.5.0	2.47.0	community-charts	View Install
rook-ceph	1.14.2	v1.14.2	rook-release	View Install
kubevirt	0.59.0	v1.0.0	kubevirt-charts	View Install

ai-cluster production Model

Model Status

Monitor LLM inference health, latency, and GPU allocation in real time.

production Last 1h

Models

Llama-3-70B

KV Cache Usage

91.2%

Max across replicas

Queue Depth

Waiting requests

Success Rate

99.7%

Request success

Time to First Token

P95 / P99 latency

142 ms

Per Output Token

P95 / P99 latency

18 ms

End-to-End Latency

P95 / P99 latency

1.24 s

Throughput

Tokens per second

2.4K tok/s

Finish Reason

Request completion breakdown

stop length abort

vGPU Memory

4 / 4 allocated

gpu-node-01 92%

gpu-node-02 88%

gpu-node-03 76%

ai-cluster production Pods

Pods

Filter resources…

Create Pod

Name	Namespace	Ready	Status	Restarts	Age
llm-inference-7f8b9c	ai-prod	2/2	Running	0	2d
vllm-worker-0	ai-prod	1/1	Running	1	2d
embedding-svc-4a2c	ai-prod	1/1	Running	0	5h
rag-indexer-batch	ai-prod	0/1	Pending	0	12m
prometheus-server-0	monitoring	2/2	Running	0	14d
grafana-7b4d89	monitoring	1/1	Running	0	14d

6 resources 1 / 1

ai-cluster production Deployments

Deployments

Filter resources…

Create Deployment

Name	Namespace	Ready	Available	Age
llm-inference	ai-prod	2/2	2	2d
vllm-worker	ai-prod	1/1	1	2d
embedding-svc	ai-prod	1/1	1	5h
prometheus-server	monitoring	2/2	2	14d
grafana	monitoring	1/1	1	14d

5 resources 1 / 1

ai-cluster production Object Buckets

Object Bucket Claims

Filter resources…

Create Bucket

Name	Namespace	Storage Class	Status	Age
model-artifacts	ai-prod	ceph-rbd	Bound	14d
training-data	ai-prod	ceph-rgw	Bound	7d
backup-snapshots	kube-system	ceph-rbd	Bound	30d
rag-documents	ai-prod	ceph-rgw	Pending	2h

4 resources 1 / 1

ai-cluster production Services

Services

Filter resources…

Create Service

Name	Namespace	Type	Cluster IP	Age
llm-inference-svc	ai-prod	ClusterIP	10.96.1.42	2d
vllm-worker-svc	ai-prod	ClusterIP	10.96.2.18	2d
embedding-svc	ai-prod	ClusterIP	10.96.3.55	5h
prometheus-server	monitoring	ClusterIP	10.96.8.10	14d

4 resources 1 / 1

Core Technology Benefits 核心技術效益

Measurable performance and efficiency gains powered by Phison HCI's proprietary technologies. 群聯 HCI 自研核心技術帶來可量化的效能與效率提升。

Lower Inference Cost更低推論成本

Reduce idle resources and directly lower the per-token inference cost.降低閒置率，直接減少每 Token 推論成本。

Linear Scale-Out線性橫向擴展

Add new nodes to linearly increase throughput without redeploying the model.新增節點即可線性提升吞吐量，無需重新部署模型。

Higher Concurrency更高並發容量

Combined with vGPU partitioning, a single host can serve more concurrent requests simultaneously.結合 vGPU 切割，同台主機可同時服務更多並行請求。

vGPU Resource Partitioning Technology vGPU 資源切割技術

A single GPU can be divided into multiple virtual GPU instances, allowing different workloads — such as training, inference, and batch processing — to share the same card. This eliminates idle GPU waste and enables fine-grained resource scheduling. Phison HCI Platform supports GPU virtualization and resource partitioning, allowing a single GPU to be dynamically allocated to multiple AI tasks or users and preventing GPU idle waste. 單張 GPU 可切割為多個虛擬 GPU 實例，讓訓練、推論、批次處理等不同工作負載共享同一張卡，消除 GPU 閒置並實現精細化資源調度。群聯 HCI 平台支援 GPU 虛擬化與資源切割，可將單卡動態分配給多個 AI 任務或使用者，避免 GPU 閒置浪費。

Core value 核心價值

Maximizes GPU utilization 最大化 GPU 利用率
Lowers AI adoption costs 降低 AI 導入成本
Enables multiple workloads to run in parallel 支援多工作負載並行運行
Supports secure multi-tenant isolation 支援安全的多租戶隔離

Applicable scenarios 適用場景

Shared AI workstations 共享 AI 工作站
Multi-department AI development 多部門 AI 開發
AI inference service platforms AI 推論服務平台
GPU resource pool management GPU 資源池管理

Multi-Node KV Cache Expansion Technology 多節點 KV Cache 擴充技術

Phison's self-developed KV Cache expansion technology uses high-speed NVMe storage as an extension of GPU HBM. It addresses the context-length limitations of large models and supports shared cache across multiple nodes, significantly reducing GPU VRAM pressure and improving large-model inference efficiency. 群聯自研 KV Cache 擴充技術，以高速 NVMe 儲存延伸 GPU HBM，突破大型模型上下文長度限制，支援多節點共享 Cache，顯著降低 GPU 顯存壓力並提升大模型推論效率。

Core technical features 核心技術特性

GPU / DRAM / SSD / Remote SSD hierarchical caching architecture GPU / DRAM / SSD / Remote SSD 分層快取架構
Dynamic KV Cache expansion 動態 KV Cache 擴充
Support for long-context inference 支援長上下文推論
Shared cache resources across multiple nodes 多節點共享快取資源

Technical benefits 技術效益

Improves model inference throughput 提升模型推論吞吐量
Reduces GPU memory bottlenecks 降低 GPU 記憶體瓶頸
Reduces the need to purchase high-end GPUs 降低高階 GPU 採購需求
Improves overall GPU usage rate 提升整體 GPU 使用率