Apache HBase is a rapidly-evolving, always-on system that can encounter rare situations where the internal state becomes corrupted: causing data loss and data unavailability. In this presentation we’ll discuss the common symptoms and causes of poor cluster health and the challenges faced by administrators in diagnosing issues and finding root cause. We’ll pay careful attention to episodes of HBase instability and how those were diagnosed and resolved. We’ll start by pointing out early symptoms found in the various log files available to administrators. We’ll discuss how HBase issues often appear first in Zookeeper, and how administrators can get helpful information about HBase from Zookeeper. We’ll conclude by detailing how to recover from the rare situations of corruption: some of the core internal HBase invariants, some causes of these corruptions, and then real-world techniques and recently-introduced tools that can be used for repair.