# RECORD SIMULATION OF THE FULL-DENSITY SPIKING POTJANS-DIESMANN-MICROCIRCUIT MODEL ON THE IBM NEURAL SUPERCOMPUTER INC 3000

Arne Heittmann<sup>1</sup>, Georgia Psychou<sup>1</sup>, Guido Trensch<sup>2</sup>, Charles E. Cox<sup>3</sup>, Winfried W. Wilcke<sup>3</sup>, Markus Diesmann<sup>4</sup> and Tobias G. Noll<sup>1</sup>

March 30 2022

#### ACA- towards multi-scale natural-density Neuromorphic Computing

<sup>1</sup>JARA-Institute Green IT (PGI-10), Jülich Research Centre, D-52425 Jülich, Germany <sup>2</sup>Simulation & Data Lab Neuroscience, Jülich Supercomputing Centre Jülich Research Centre, D-52425 Jülich, Germany <sup>3</sup>IBM Research Division, Almaden Research Center, San Jose, CA 9512 <sup>4</sup>Institute of Neuroscience and Medicine (INM-6)













# **NEUROMORPHIC COMPUTING**

• towards advanced general purpose computing architectures



#### Neuroscience

- dynamics of natural brains
- function of natural brains
- learning, plasticity, development

#### **Artificial General Intelligence (AGI)**

- deep learning using few samples
- ability for contextual adaptation
- ability to explain descisions in natural language





## **ACA – ADVANCED COMPUTING ARCHITECTURES**

#### www.fz-juelich.de/aca

towards multi-scale natural-density neuromorphic computing

- Pilot project preparing a long-term research initiative in the application area of *Neuroscience Simulation*
- Specification of a neuromorphic computing architecture for accelerating simulation experiments



Requirements, Validation & Benchmarking



System Definition, Integration & Operation

Network Connectivity & Communication



**Accelerated Numerics** 





#### **SCALES OF BRAIN ORGANIZATION**







# **Trends in Neuroscience Simulation Experiments**



#### THE CORTICAL MICROCIRCUIT

Cerebral Cortex March 2014;24:785–806 doi:10.1093/cercor/bhs358 Advance Access publication December 2, 2012

#### The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model

Tobias C. Potjans<sup>1,2,3</sup> and Markus Diesmann<sup>1,2,4,5</sup>





Source: Tony Mosconi, Victoria Graham: Neuroscience for Rehabilitation Copyright © McGraw-Hill Education. All rig

**JÜLICH** Forschungszentrum Page 6

#### THE IBM INC 3000 NEURAL SUPERCOMPUTER

- Originally developed for the IBM AGI (Artificial General Intelligence) Project (IBM Research, Almaden, CA)
- Platform for development and evaluation of prototypical circuit and architecture concepts for ACA



Communication Network



**Communication Network Topology:** 

12 x 12 x 3 3D-node mesh

#### **Cross-sectional bandwidth:**

B = 450 Gb/s

Worst case packet-path (betw. A and B) Latency T=24 µs

- 16 x INC board
- 432 SoC Nodes



#### NODE SCHEMATICS, MODELS, AND DESIGN FLOW



- 256-Neuron-Fascicle per Node
- Various Point-Neuron Models
  - LIF, MAT2, Izhikevich, AdEx
- Several Synapse Models
  - CUBA- and COBA-based,
  - Exponential Decay-, Alpha-, Beta-Shaped

#### Several ODE System Solvers

- Exact Exponential
- Runge-Kutta
- Parker-Sochacki
- Network Generation "on the fly" during simulation run time based on highly efficient PRNGs ("Procedural Connectivity")
- Node-Synchronization by Barrier-Messages - avoid spike-loss
- Single-Float Precision Arithmetics
- High-Level-Synthesis Design Flow



# **ODE SOLVER**



$$\tau_m \cdot \frac{dV_q}{dt} = -(V_q - E_L) + R_M \cdot I_{M,q}$$



- lumped CUBA-synapses
  - exponential decay

$$\tau_{s,x} \cdot \frac{dI_{M,q,x}}{dt} = -I_{M,q,x} + I_{S,q,x} \quad , \ x \in \{e,i\}$$

- State-Vector update
  - exact exponential for linear ODEs



synthesized datapaths, pipelined





update equation, 3 state variables per neuron



### LUMPED SYNAPSES



synaptic equation







### **PROCEDURAL CONNECTIVITY**



Parameters of a synapse C



ACA



ax. + dend. delay weight postsynaptic neuron



example network



 (naive) memory layout on the receiving node



#### **PROCEDURAL CONNECTIVITY**





- minimal time from ,address valid" to "first data out"



### **NETWORK GENERATION**

**Microcircuit:** connections *C* are defined by pseudo random numbers





- In NEST: all synapse parameters are drawn offline using pseudo-random-number generators (PRNGs)
- Alternative approach: implement these random number generators on-the-chip
  - re-generate the synaptic parameters "on-the-fly", when needed
  - initial seeds define the deterministic sequence of random numbers



#### **PROCEDURAL CONNECTIVITY**



# PROCEDURAL CONNECTIVITY

seed-lookup tables

ACA

• RTR data path

RTR control loop



Walker, A. J., "An Efficient Method for Generating Discrete

on Mathematical Software. 3 (3): 253-256, 1977,

doi:10.1145/355744.355749

Random Variables with General Distributions". ACM Transactions

- latency: 3 cycles @ 150 MHz
- fully pipelined

#### Gate and Memory Breakdown (LIF, CUBA)

Overall FPGA resources



### LATENCY-BREAKDOWN : THE MICRO-CIRCUIT



- procedural connectivity: 20 % speedup over ext. DRAM
  - Speedup X 4.06 over BRT



Page 18

### **TOWARDS REALISTIC NEURON MODELS**

|                                                                                   | Galvanic<br>interconnect | The 'common case'                                                                                                                                                                                                                                      | Computational Complexity                                                                                                                                                                               |
|-----------------------------------------------------------------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                   | beint-neuron             | <ul> <li>Linear point-neuron dynamics</li> <li>No dendritic tree (galvanic interconnect)</li> </ul>                                                                                                                                                    | <ul> <li>ODE integration by exponential Euler's method</li> <li>High degree of parallelism</li> </ul>                                                                                                  |
| $V_{i+1}$<br>$V_{i+1}$<br>$C_{m,i}$<br>$V_i$<br>$V_i$<br>$V_{i-1}$<br>compartment | RCG-T-section            | <ul> <li>Nonlinear neuron dynamics<br/>(e.g. lzhikevich, AdEx, HH)</li> <li>compartmental dendritic tree,<br/>model of 'the dendritic<br/>computational toolkit' *)</li> <li>*) M.London, M.Häusser,<br/>Annu.Rev.Neurosci. 2005.28:503-532</li> </ul> | <ul> <li>ODE integration by advanced numeric methods</li> <li>Limited degree of parallelism and significant critical paths in arithmetics</li> <li>particular strategies for stiff problems</li> </ul> |
| ·                                                                                 |                          |                                                                                                                                                                                                                                                        | JÜLICH                                                                                                                                                                                                 |

Page 19



Eex

E<sub>inh</sub>

Slink,i+1

g<sub>ex,i</sub>

Sinh,i

Elink,i

#### **TOWARDS REALISTIC NEURON MODELS**





### **ODE SOLVERS**



Page 21



# SPEED-UP FACTOR $G_{BRT}$ FOR VARIOUS MODELS



Exact Integration (applies for linear ODEs only)

 $V(t+h) = e^{Ah} \cdot V(t)$ 

S-step Runge-Kutta Method (RK-S)

$$k_{j} = f\left(\begin{array}{c}t_{n} + h \cdot c_{j} ; V(t) + h \cdot \sum_{l=1}^{s} a_{jl} \cdot k_{l}\end{array}\right), j = 1,...,s$$
$$V(t+h) = V(t) + h \cdot \sum_{j=1}^{s} b_{j} \cdot k_{j}$$

S-step Parker-Sochacki Method (PS-S)

$$V_{i}(t+h) = V_{i}(t) + \frac{V_{i}(t)}{1!} \cdot h + \frac{V_{i}(t)}{2!} \cdot h^{2} + \frac{V_{i}(t)}{3!} \cdot h^{3} + \dots + \frac{V_{i}(s)(t)}{s!} \cdot h^{s}$$



Page 22

#### **Trends in State-of-the-Art Communication Standards**



 Latency grows with the distance of compute-nodes

- Avoid large latencies by
  - few hubs between nodes
  - dense (3D-stacked) circuit integration
- Conceptually, highly integrated components with short physical distance could be required



## **Trends in State-of-the-Art DRAM Performance**



K.K.Change et al., "Understanding Latency Variation in modern DRAM Chips: Experimental Characterization, Analysis, and Optimization", doi:10.1145/2896377.2901453 (2016)

- No significant trend in access latency reduction
- Neuromorphic Computing:

aca

- **Plasticity** will incur significant traffic to the memory system
- latency issue could be solved e.g. by using an embedded DRAM technology in a dedicated accelerator circuit

t<sub>CL</sub> [ns]

 Performance data from 54 different commercially available SDRAMs (2022)



### CONCLUSION

• Components in conventional HPC systems are optimized for both:

large data packets and large communication & memory bandwidth

 $\rightarrow$  results in **large latencies** for .... memory accesses and packet transmission

- Requirements for a future accelerated neuromorphic compute platform
  - + communication of data packets comprising small size (spikes)
  - + fully random memory access to **small data packets** (e.g. synapse parameters) with very restricted locality properties (caches won't work efficiently)
  - + ultra-short latencies for communication and memory access

The requirements for future NC-Platform and HPC are complementary

- New memory architecture : near-memory computation with short memory latencies
- Hierarchical networks with ultra-short communication latency
- *High-package density (2.5D/3D stacked silicon on interposer)*



Page 25



#### THE LONG-TERM GOAL: SUSTAINABLY STAY AHEAD



"Hybrid Neuromorphic-von-Neumann general purpose computing"