Environment:
Hardware: Mac M4
OS: macOS Sequoia 15.7.4
TensorFlow-macOS Version: 2.16.2
TensorFlow-metal Version: 1.2.0
Description:
When using the tensorflow-metal plug-in for GPU acceleration on M4, the ReLU activation function (both as a layer and as an activation argument) fails to correctly clip negative values to zero. The same code works correctly when forced to run on the CPU.
Reproduction Script:
import os
import numpy as np
import tensorflow as tf
# weights and biases = -1
weights = [np.ones((10, 5)) * -1, np.ones(5) * -1]
# input = 1
data = np.ones((1, 10))
# comment this line => GPU => get negative values
# uncomment this line => CPU => no negative values
# tf.config.set_visible_devices([], 'GPU')
# create model
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(10,)),
tf.keras.layers.Dense(5, activation='relu')
])
# set weights
model.layers[0].set_weights(weights)
# get output
output = model.predict(data)
# check if negative is present
print(f"min value: {output.min()}")
print(f"is negative present? {np.any(output < 0)}")
ML Compute
RSS for tagAccelerate training and validation of neural networks using the CPU and GPUs.
Posts under ML Compute tag
23 Posts
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Subject: Technical Report: Float32 Precision Ceiling & Memory Fragmentation in JAX/Metal Workloads on M3
To: Metal Developer Relations
Hello,
I am reporting a repeatable numerical saturation point encountered during sustained recursive high-order differential workloads on the Apple M3 (16 GB unified memory) using the JAX Metal backend.
Workload Characteristics:
Large-scale vector projections across multi-dimensional industrial datasets
Repeated high-order finite-difference calculations
Heavy use of jax.grad and lax.cond inside long-running loops
Observation:
Under these conditions, the Metal/MPS backend consistently enters a terminal quantization lock where outputs saturate at a fixed scalar value (2.0000), followed by system-wide NaN propagation. This appears to be a precision-limited boundary in the JAX-Metal bridge when handling high-order operations with cubic time-scale denominators.
have identified the specific threshold where recursive high-order tensor derivatives exceed the numerical resolution of 32-bit consumer architectures, necessitating a migration to a dedicated 64-bit industrial stack.
I have prepared a minimal synthetic test script (randomized vectors only, no proprietary logic) that reliably reproduces the allocator fragmentation and saturation behavior. Let me know if your team would like the telemetry for XLA/MPS optimization purposes.
Best regards,
Alex Severson
Architect, QuantumPulse AI
Hi everyone,
I am developing a benchmarking tool to measure memory latency (L1/L2/DRAM) on Apple Silicon. I am currently using Xcode Instruments (CPU Counters) to validate my results.
In my latest run for a 128 MB buffer with random access, Instruments shows:
Latency (cycles): ~259 cycles (derived from LDST_UNIT_OLD_L1D_CACHE_MISS / L1D_CACHE_MISS_LD).
Manual Timer Result: ~80 ns.
To correlate these two values, I need the exact CPU Frequency (GHz) at the time of the sample.
My Questions:
Is there a recommended way to programmatically fetch the current frequency of the Performance cores (p-cores) during a benchmark run?
Does Apple provide a "nominal" frequency value for M-series chips that we should use for cycle-to-nanosecond conversions?
In Instruments, is there a hidden counter or "Average Frequency" metric that I can enable to avoid manual math?
Hardware/Software Environment:
Tool: Instruments 26.3+ (CPU Counters Template).
Chip: A19, iPhone 17 pro.
OS: 26.3.
Topic:
Developer Tools & Services
SubTopic:
Instruments
Tags:
Developer Tools
ML Compute
Instruments
Kernel
After exerting a custom model with nms=True.
In Xcode, the outputs show as:
confidence: MultiArray (0 × 5)
coordinates: MultiArray (0 × 4)
I want to set fixed shapes (e.g., 100 × 5, 100 × 4), but Xcode does not allow editing—the shape fields are locked. The model graph shows both outputs come directly from a NonMaximumSuppression layer.
Is it possible to set fixed output dimensions for NMS outputs in CoreML?
what is the diff between INST_ALL and Instructions(FIXED_INSTRUCTIONS)?
also CORE_ACTIVE_CYCLE VS Cycles(FIXED_CYCLES)
We’ve encountered what appears to be a CoreML regression between macOS 26.0.1 and macOS 26.1 Beta.
In macOS 26.0.1, CoreML models run and produce correct results. However, in macOS 26.1 Beta, the same models produce scrambled or corrupted outputs, suggesting that tensor memory is being read or written incorrectly. The behavior is consistent with a low-level stride or pointer arithmetic issue — for example, using 16-bit strides on 32-bit data or other mismatches in tensor layout handling.
Reproduction
Install ON1 Photo RAW 2026 or ON1 Resize 2026 on macOS 26.0.1.
Use the newest Highest Quality resize model, which is Stable Diffusion–based and runs through CoreML.
Observe correct, high-quality results.
Upgrade to macOS 26.1 Beta and run the same operation again.
The output becomes visually scrambled or corrupted.
We are also seeing similar issues with another Stable Diffusion UNet model that previously worked correctly on macOS 26.0.1. This suggests the regression may affect multiple diffusion-style architectures, likely due to a change in CoreML’s tensor stride, layout computation, or memory alignment between these versions.
Notes
The affected models are exported using standard CoreML conversion pipelines.
No custom operators or third-party CoreML runtime layers are used.
The issue reproduces consistently across multiple machines.
It would be helpful to know if there were changes to CoreML’s tensor layout, precision handling, or MLCompute backend between macOS 26.0.1 and 26.1 Beta, or if this is a known regression in the current beta.
Hi everyone,
I believe I’ve encountered a potential bug or a hardware alignment limitation in the Core ML Framework / ANE Runtime specifically affecting the new Stateful API (introduced in iOS 18/macOS 15).
The Issue:
A Stateful mlprogram fails to run on the Apple Neural Engine (ANE) if the state tensor dimensions (specifically the width) are not a multiple of 32. The model works perfectly on CPU and GPU, but fails on ANE both during runtime and when generating a Performance Report in Xcode.
Error Message in Xcode UI:
"There was an error creating the performance report Unable to compute the prediction using ML Program. It can be an invalid input data or broken/unsupported model."
Observations:
Case A (Fails): State shape = (1, 3, 480, 270). Prediction fails on ANE.
Case B (Success): State shape = (1, 3, 480, 256). Prediction succeeds on ANE.
This suggests an internal memory alignment or tiling issue within the ANE driver when handling Stateful buffers that don't meet the 32-pixel/element alignment.
Reproduction Code (PyTorch + coremltools):
import torch.nn as nn
import coremltools as ct
import numpy as np
class RNN_Stateful(nn.Module):
def __init__(self, hidden_shape):
super(RNN_Stateful, self).__init__()
# Simple conv to update state
self.conv1 = nn.Conv2d(3 + hidden_shape[1], hidden_shape[1], kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(hidden_shape[1], 3, kernel_size=3, padding=1)
self.register_buffer("hidden_state", torch.ones(hidden_shape, dtype=torch.float16))
def forward(self, imgs):
self.hidden_state = self.conv1(torch.cat((imgs, self.hidden_state), dim=1))
return self.conv2(self.hidden_state)
# h=480, w=255 causes ANE failure. w=256 works.
b, ch, h, w = 1, 3, 480, 255
model = RNN_Stateful((b, ch, h, w)).eval()
traced_model = torch.jit.trace(model, torch.randn(b, 3, h, w))
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(name="input_image", shape=(b, 3, h, w), dtype=np.float16)],
outputs=[ct.TensorType(name="output", dtype=np.float16)],
states=[ct.StateType(wrapped_type=ct.TensorType(shape=(b, ch, h, w), dtype=np.float16), name="hidden_state")],
minimum_deployment_target=ct.target.iOS18,
convert_to="mlprogram"
)
mlmodel.save("rnn_stateful.mlpackage")
Steps to see the error:
Open the generated .mlpackage in Xcode 16.0+.
Go to the Performance tab and run a test on a device with ANE (e.g., iPhone 15/16 or M-series Mac).
The report will fail to generate with the error mentioned above.
Environment:
OS: macOS 15.2
Xcode: 16.3
Hardware: M4
Has anyone else encountered this 32-pixel alignment requirement for StateType tensors on ANE? Is this a known hardware constraint or a bug in the Core ML runtime?
Any insights or workarounds (other than manual padding) would be appreciated.
I am running some experiments with WebGPU using the wgpu crate in rust. I have some Buffers already allocated in the GPU.
Is it possible to use those already existing buffers directly as inputs to a predict call in CoreML? I want to prevent gpu to cpu download time as much as possible.
Or are there any other ways to do something like this. Is this only possible using the latest Tensor object which came out with Metal 4 ?
Deterministic RNG behaviour across Mac M1 CPU and Metal GPU – BigCrush pass & structural diagnostics
Hello,
I am currently working on a research project under ENINCA Consulting, focused on advanced diagnostic tools for pseudorandom number generators (structural metrics, multi-seed stability, cross-architecture reproducibility, and complementary indicators to TestU01).
To validate this diagnostic framework, I prototyped a small non-linear 64-bit PRNG (not as a goal in itself, but simply as a vehicle to test the methodology).
During these evaluations, I observed something interesting on Apple Silicon (Mac M1):
• bit-exact reproducibility between M1 ARM CPU and M1 Metal GPU,
• full BigCrush pass on both CPU and Metal backends,
• excellent p-values,
• stable behaviour across multiple seeds and runs.
This was not the intended objective, the goal was mainly to validate the diagnostic concepts, but these results raised some questions about deterministic compute behaviour in Metal.
My question: Is there any official guidance on achieving (or expecting) deterministic RNG or compute behaviour across CPU ↔ Metal GPU on Apple Silicon? More specifically:
• Are deterministic compute kernels expected or guaranteed on Metal for scientific workloads?
• Are there recommended patterns or best practices to ensure reproducibility across GPU generations (M1 → M2 → M3 → M4)?
• Are there known Metal features that can introduce non-determinism?
I am not sharing the internal recurrence (this work is proprietary), but I can discuss the high-level diagnostic observations if helpful.
Thank you for any insight, very interested in how the Metal engineering team views deterministic compute patterns on Apple Silicon.
Pascal ENINCA Consulting
Topic:
Graphics & Games
SubTopic:
Metal
Tags:
ML Compute
Metal
Metal Performance Shaders
Apple Silicon
果敢腾 龙企业有限公司,薇---184-7933-278成立于2006年8月12日,下面给大家介绍一下公司主要位置和主要经营那些行业,腾龙公司位置靠于云南省临沧市边联的一个小城市,名称果敢老街,实时位置果敢城市中心双峰塔附近,公司主要是经营,旅游业,酒店服务行业,建筑业,科技游戏,餐饮,等等,这个有名的小城市虽然不是很大,但有着纸醉金迷小澳门的名誉之称呼,今天给大家介绍的就这些,如果你有想旅游的心,也欢迎来这个小城市旅游,来感受一下这里的民族风情。
“iOS 26 + BGContinuedProcessingTask: Why does a CPU/ML-intensive job run 4-5× slower in background?”
Hello All,
I’m a mobile-app developer working with iOS 26+ and I’m using BGContinuedProcessingTask to perform background work. My app’s workflow includes the following business logic:
Loading images via PHImageRequest.
Using a CLIP model to extract image embeddings.
Using an .mlmodel-based model to further process those embeddings.
For both model inferences I set computeUnits = .cpuAndNeuralEngine.
When the app is moved to the background, I observe that the same workload(all three workload) becomes on average 4-5× slower than when the app is in the foreground.
In an attempt to diagnose the slowdown, I tried to profile with Xcode Instruments, but since a debugger was attached, the performance in background appeared nearly identical to foreground. Even when I detached the debugger, the measured system resource metrics (process CPU usage, system CPU usage, memory, QoS class, thermal state) showed no meaningful difference.
Below are some of the metrics I captured:
Process CPU: 177% (Foreground) → 153% (Background) → ~-24.1%
Still >1.5 cores of work.
System CPU: 56.1% → 38.4% → ~-17.7%
Process Memory: 244.8 MB → 218.1 MB
QoS Class: userInitiated in both cases
Thermal State: nominal in both cases
Given these results, I’m finding it hard to pinpoint why the overall latency is so much worse when the app is backgrounded, even though the obvious metrics show little variation.
I suspect the cause may involve P-core vs E-core scheduling, or internal hardware throttling/limit of Neural Engine usage, but I cannot find clear documentation or logging to confirm this.
My question is:
Does anyone know why a CPU (and Neural Engine)-intensive job like this would slow down so dramatically when using BGContinuedProcessingTask in the background on iOS 26+, despite apparent similar resource-usage metrics?
Are there internal iOS scheduling/hardware-allocation behaviors (e.g., falling back to lower-performing cores when backgrounded) that might explain this?
Any pointers to Apple technical notes, system logs, or instrumentation I might use to detect which cores or compute units are being used would be greatly appreciated.
Thank you for your time and any guidance you can provide.
Best regards,
Hello, is it allowed to use Foundation Model Framework in submission app for WWDC26? The thing is that Apple Intelligence needs to be enabled in the settings. So, does that mean the jury won't be able to fully utilize the app's AI functionality?
Hi everyone
Im currently developing an object detection model that shall identify up to seven classes in an image. While im usually doing development with basic python and the ultralytics library, i thought i would like to give CreateML a shot. The experience is actually very nice, except for the fact that the model seem not to be using any ANE or GPU (MPS) for accelerated training.
On https://developer.apple.com/machine-learning/create-ml/ it states: "On-device training Train models blazingly fast right on your Mac while taking advantage of CPU and GPU."
Am I doing something wrong?
Im running the training on
Apple M1 Pro 16GB
MacOS 26.1 (Tahoe)
Xcode 26.1 (Build version 17B55)
It would be super nice to get some feedback or instructions.
Thank you in advance!
Using Tensorflow for Silicon gives inaccurate results when compared to Google Colab GPU (9-15% differences). Here are my install versions for 4 anaconda env's. I understand the Floating point precision can be an issue, batch size, activation functions but how do you rectify this issue for the past 3 years?
1.) Version TF: 2.12.0, Python 3.10.13, tensorflow-deps: 2.9.0, tensorflow-metal: 1.2.0, h5py: 3.6.0, keras: 2.12.0
2.) Version TF: 2.19.0, Python 3.11.0, tensorflow-metal: 1.2.0, h5py: 3.13.0, keras: 3.9.2, jax: 0.6.0, jax-metal: 0.1.1,jaxlib: 0.6.0, ml_dtypes: 0.5.1
3.) python: 3.10.13,tensorflow: 2.19.0,tensorflow-metal: 1.2.0, h5py: 3.13.0, keras: 3.9.2, ml_dtypes: 0.5.1
4.) Version TF: 2.16.2, tensorflow-deps:2.9.0,Python: 3.10.16, tensorflow-macos 2.16.2, tensorflow-metal: 1.2.0, h5py:3.13.0, keras: 3.9.2, ml_dtypes: 0.3.2
Install of Each ENV with common example:
Create ENV: conda create --name TF_Env_V2 --no-default-packages
start env: source TF_Env_Name
ENV_1.) conda install -c apple tensorflow-deps , conda install tensorflow,pip install tensorflow-metal,conda install ipykernel
ENV_2.) conda install pip python==3.11, pip install tensorflow,pip install tensorflow-metal,conda install ipykernel
ENV_3) conda install pip python 3.10.13,pip install tensorflow, pip install tensorflow-metal,conda install ipykernel
ENV_4) conda install -c apple tensorflow-deps, pip install tensorflow-macos, pip install tensor-metal, conda install ipykernel
Example used on all 4 env:
import tensorflow as tf
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)
Context
I’m deploying large language models on iPhone using llama.cpp. A new iPhone Air (12 GB RAM) reports a Metal MTLDevice.recommendedMaxWorkingSetSize of 8,192 MB, and my attempt to load Llama-2-13B Q4_K (~7.32 GB weights) fails during model initialization.
Environment
Device: iPhone Air (12 GB RAM)
iOS: 26
Xcode: 26.0.1
Build: Metal backend enabled llama.cpp
App runs on device (not Simulator)
What I’m seeing
MTLCreateSystemDefaultDevice().recommendedMaxWorkingSetSize == 8192 MiB
Loading Llama-2-13B Q4_K (7.32 GB) fails to complete. Logs indicate memory pressure / allocation issues consistent with the 8 GB working-set guidance.
Smaller models (e.g., 7B/8B with similar quantization) load and run (8B Q4_K provide around 9 tokens/second decoding speed).
Questions
Is 8,192 MB an expected recommendedMaxWorkingSetSize on a 12 GB iPhone?
What values should I expect on other 2025 devices including iPhone 17 (8 GB RAM) and iPhone 17 Pro (12 GB RAM)
Is it strictly enforced by Metal allocations (heaps/buffers), or advisory for best performance/eviction behavior?
Can a process practically exceed this for long-lived buffers without immediate Jetsam risk?
Any guidance for LLM scenarios near the limit?
Does anyone know if ExecuTorch is officially supported or has been successfully used on visionOS? If so, are there any specific build instructions, example projects, or potential issues (like sandboxing or memory limitations) to be aware of when integrating it into an Xcode project for the Vision Pro?
While ExecuTorch has support for iOS, I can't find any official documentation or community examples specifically mentioning visionOS.
Thanks.
WWDC25: Combine Metal 4 machine learning and graphics
Demonstrated a way to combine neural network in the graphics pipeline directly through the shaders, using an example of Texture Compression. However there is no mention of using which ML technique texture is compressed.
Can anyone point me to some well known model/s for this particular use case shown in WWDC25.
From tensorflow-metal example:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
I know that Apple silicon uses UMA, and that memory copies are typical of CUDA, but wouldn't the GPU memory still be faster overall?
I have an iMac Pro with a Radeon Pro Vega 64 16 GB GPU and an Intel iMac with a Radeon Pro 5700 8 GB GPU.
But using tensorflow-metal is still WAY faster than using the CPUs. Thanks for that. I am surprised the 5700 is twice as fast as the Vega though.
I'm developing a tennis ball tracking feature using Vision Framework in Swift, specifically utilizing VNDetectedObjectObservation and VNTrackObjectRequest.
Occasionally (but not always), I receive the following runtime error:
Failed to perform SequenceRequest: Error Domain=com.apple.Vision Code=9 "Internal error: unexpected tracked object bounding box size" UserInfo={NSLocalizedDescription=Internal error: unexpected tracked object bounding box size}
From my investigation, I suspect the issue arises when the bounding box from the initial observation (VNDetectedObjectObservation) is too small. However, Apple's documentation doesn't clearly define the minimum bounding box size that's considered valid by VNTrackObjectRequest.
Could someone clarify:
What is the minimum acceptable bounding box width and height (normalized) that Vision Framework's VNTrackObjectRequest expects?
Is there any recommended practice or official guidance for bounding box size validation before creating a tracking request?
This information would be extremely helpful to reliably avoid this internal error.
Thank you!
Topic:
Media Technologies
SubTopic:
Photos & Camera
Tags:
ML Compute
Machine Learning
Camera
AVFoundation
I followed below url for converting Llama-3.1-8B-Instruct model but always fails even i have 64GB of free space after downloading model from huggingface.
https://machinelearning.apple.com/research/core-ml-on-device-llama
Also tried with other models Llama-3.1-1B-Instruct & Llama-3.1-3B-Instruct models those are converted but while doing performance test in xcode fails for all compunits.
Is there any source code to run llama models in ios app.