Yahoo! Japan’s owner consolidating 164 OpenStack clusters into one

Customizations are causing pain so new cloud will stick to upstream cuts of the open source stack

by · The Register

LY Corporation, the Japanese web giant that dominates messaging, e-commerce and payments in many Asian countries, has revealed it is replacing a heavily-customized OpenStack cloud with a more conventional cut of the open source cloud stack – and making massive consolidations along the way.

Formed in 2023 when Yahoo! Japan merged with Korean messaging giant LINE, LY Corp is trying to merge its infrastructure into a new unified cloud called “Flava” to power its services. That cloud needs to operate at significant scale, because its services like the LINE messaging app and the Yahoo portal have around 300 million monthly users.

Late last week, the company revealed that LINE’s internal cloud, called “Verda”, comprised 130,000 VMs running across 11,000 hosts that sprawled across four OpenStack clusters. Yahoo! Japan’s “YNW” cloud ran on 27,000 servers, and saw more than 160,000 VMs run across over 160 OpenStack clusters.

The company’s plan for the new “Flava” cloud calls for 500 or more hosts, 9,000-plus VMs, and a single OpenStack cluster. The company also uses the open source Envoy proxy, Linux and the extended Berkeley packet filter (eBPF) and express data path (XDP), FRRouting (FRR), and Ceph.

“In the legacy cloud, too many custom modifications to OpenStack made upgrades difficult,” according to Ryuutarou Inoue, the head of LY’s Cloud Infrastructure Unit. “Flava adopts an architecture that stays aligned with upstream OpenStack. We keep custom patches to a minimum, and when functional changes are needed, we proactively contribute them upstream so they can be merged into the main project.”

“By removing upgrade barriers, we enable a regular update cadence and keep both security and the latest features continuously available,” he added.

Inoue said LY also aims to “avoid over-investing in availability guarantees at the infrastructure layer alone” and instead assumes failure is always possible. He said Flava’s design tries to cover that with the following three “pillars”:

  • Pursuing statelessness - We define data stored on a virtual machine’s (VM) root disk (ephemeral disk) as temporary. We move persistent data to external storage to minimize service impact when an instance fails.
  • Application-driven availability - Rather than attempting to provide perfect availability through infrastructure alone, we ensure reliability by combining infrastructure with application-side architecture, reducing unnecessary infrastructure complexity.
  • Faster recovery - In an incident, the priority is not restoring the exact previous state. It’s keeping the service running. We recommend an operational approach that rebuilds environments quickly using Infrastructure as Code (IaC), rather than spending extended time on root-cause analysis first.

The company is also very keen on observability. Inoue said his team uses Prometheus, Grafana, and internal dashboards “to continuously monitor overall cloud health and trends to catch early signs of anomalies.” If those tools show signs of trouble, “we drill into deep signals such as kernel-level traces and packet captures to pinpoint the cause.”

Inoue said LY experiences hardware failures “somewhere every day” and handling them all manually is impossible. “Today, we’ve automated most of the flow, from failure detection to requesting on-site data center work and reintegrating replaced hardware back into clusters,” he wrote. “That said, some tasks and irregular failure patterns still require hands-on engineering response. Going forward, we aim to use large language models for these decision-heavy workflows as well, further advancing automation.”

LY needs this to work because it has had significant infosec problems that exposed users’ data, causing Japan’s government to order work on its tech stack to improve security and privacy. ®