This directory contains the low-level SIMD backend used throughout Tensorium to accelerate vector, matrix, and tensor operations. It provides abstraction layers over AVX, AVX2, AVX512, and SSE instruction sets, with automatic runtime detection and alignment-aware memory allocation. Writing separate code for each ISA is unnecessary — all operations are unified through SimdTraits with graceful fallback to the best supported instruction set.

Purpose

The module enables:

High-performance vectorized operations on aligned data
Support for AVX, SSE, and AVX512 with fallback logic
Architecture-specific runtime dispatching
Cache hierarchy introspection and memory-aware optimization

Structure

SIMD/
├── SIMD.hpp         // Abstractions for AVX, AVX512, and SSE intrinsics
├── Allocator.hpp    // Aligned memory allocation (posix_memalign / hbwmalloc)
├── CPU_id.hpp       // CPU feature detection (AVX, AVX2, AVX512F, SSE4.2, etc.)
└── CacheInfo.hpp    // L1/L2/L3 cache size detection and prefetch tuning

Features

SIMD.hpp

SimdTraits<T, ISA> specialization for scalar types (float, double, size_t)
Unified interface for:
- vector loads/stores (load, store, setzero, broadcast)
- arithmetic (add, sub, fmadd, mul)
- reductions (horizontal_add, etc.)
- Complex numbers are supported
Auto-selection of best ISA at runtime or compile-time

Allocator.hpp

Provides aligned_allocator<T> for STL compatibility
Uses posix_memalign or hbwmalloc (on Xeon Phi)
Ensures 32-byte or 64-byte alignment depending on ISA

CPU_id.hpp

Detects CPU SIMD capabilities via CPUID
Flags: SSE4.2, AVX, AVX2, AVX512F, AVX512DQ, etc.
Used for runtime dispatching in templated kernels

CacheInfo.hpp

Detects L1, L2, and L3 cache sizes per core
Used to tune matrix blocking and tiling strategies
Helps choose loop unrolling depth based on available cache

Example Usage

using namespace tensorium::simd;
using T    = float;
using Simd = SimdTraits<T, DefaultISA>;
using reg  = typename Simd::reg;
 
alignas(64) T a[Simd::width] = {1.0f, 2.0f, 3.0f, 4.0f,
                                5.0f, 6.0f, 7.0f, 8.0f};
alignas(64) T b[Simd::width] = {8.0f, 7.0f, 6.0f, 5.0f,
                                4.0f, 3.0f, 2.0f, 1.0f};
alignas(64) T result[Simd::width];
 
reg va = Simd::load(a);
reg vb = Simd::load(b);
reg vr = Simd::add(va, vb);
Simd::store(result, vr);
 
for (std::size_t i = 0; i < Simd::width; ++i)
    std::cout << "result[" << i << "] = " << result[i] << "\n";

Status

Fully functional and integrated into all math kernels of Tensorium, including Vector, Matrix, Tensor, and numerical methods. Portability fallback to SSE is provided for older CPUs. AVX512 path is fully optimized.

Supported Architectures

SSE4.2 (128-bit)
AVX / AVX2 (256-bit)
AVX512F / AVX512DQ (512-bit)
Optional hbw (High Bandwidth Memory) detection on Xeon Phi (KNL)
### aarch64 NEON are incomming !