The biggest possible issue with Hierarchical RCU put forward as of this writing is the fact that force_quiescent_state() involves a potential walk through all CPUs' rcu_data structures. On a machine with thousands of CPUs, this could potentially represent an excessive impact on scheduling latency, given that this scan is conducted with interrupts disabled.
Should this become a problem in real life, one fix is to maintain separate force_quiescent_state() sequencing on a per-leaf-rcu_node basis as well as the current per-rcu_state ->signaled state variable. This would allow incremental forcing of quiescent states on a per-leaf-rcu_node basis, greatly reducing the worst-case degradation of scheduling latency.
In the meantime, those caring deeply about scheduling latency can limit the number of CPUs in the system or use the preemptible RCU implementation.