Tensorium
Loading...
Searching...
No Matches
Tensorium — SIMD Module

This directory contains the low-level SIMD backend used throughout Tensorium to accelerate vector, matrix, and tensor operations. It provides abstraction layers over AVX, AVX2, AVX512, and SSE instruction sets, with automatic runtime detection and alignment-aware memory allocation. Writing separate code for each ISA is unnecessary — all operations are unified through SimdTraits with graceful fallback to the best supported instruction set.

Purpose

The module enables:

  • High-performance vectorized operations on aligned data
  • Support for AVX, SSE, and AVX512 with fallback logic
  • Architecture-specific runtime dispatching
  • Cache hierarchy introspection and memory-aware optimization

Structure

SIMD/
├── SIMD.hpp // Abstractions for AVX, AVX512, and SSE intrinsics
├── Allocator.hpp // Aligned memory allocation (posix_memalign / hbwmalloc)
├── CPU_id.hpp // CPU feature detection (AVX, AVX2, AVX512F, SSE4.2, etc.)
└── CacheInfo.hpp // L1/L2/L3 cache size detection and prefetch tuning

Features

SIMD.hpp

  • SimdTraits<T, ISA> specialization for scalar types (float, double, size_t)
  • Unified interface for:
    • vector loads/stores (load, store, setzero, broadcast)
    • arithmetic (add, sub, fmadd, mul)
    • reductions (horizontal_add, etc.)
    • Complex numbers are supported
  • Auto-selection of best ISA at runtime or compile-time

Allocator.hpp

  • Provides aligned_allocator<T> for STL compatibility
  • Uses posix_memalign or hbwmalloc (on Xeon Phi)
  • Ensures 32-byte or 64-byte alignment depending on ISA

CPU_id.hpp

  • Detects CPU SIMD capabilities via CPUID
  • Flags: SSE4.2, AVX, AVX2, AVX512F, AVX512DQ, etc.
  • Used for runtime dispatching in templated kernels

CacheInfo.hpp

  • Detects L1, L2, and L3 cache sizes per core
  • Used to tune matrix blocking and tiling strategies
  • Helps choose loop unrolling depth based on available cache

Example Usage

using namespace tensorium::simd;
using T = float;
using Simd = SimdTraits<T, DefaultISA>;
using reg = typename Simd::reg;
alignas(64) T a[Simd::width] = {1.0f, 2.0f, 3.0f, 4.0f,
5.0f, 6.0f, 7.0f, 8.0f};
alignas(64) T b[Simd::width] = {8.0f, 7.0f, 6.0f, 5.0f,
4.0f, 3.0f, 2.0f, 1.0f};
alignas(64) T result[Simd::width];
reg va = Simd::load(a);
reg vb = Simd::load(b);
reg vr = Simd::add(va, vb);
Simd::store(result, vr);
for (std::size_t i = 0; i < Simd::width; ++i)
std::cout << "result[" << i << "] = " << result[i] << "\n";

Status

Fully functional and integrated into all math kernels of Tensorium, including Vector, Matrix, Tensor, and numerical methods. Portability fallback to SSE is provided for older CPUs. AVX512 path is fully optimized.

Supported Architectures

  • SSE4.2 (128-bit)
  • AVX / AVX2 (256-bit)
  • AVX512F / AVX512DQ (512-bit)
  • Optional hbw (High Bandwidth Memory) detection on Xeon Phi (KNL)
  • ### aarch64 NEON are incomming !