Comandi MLX
MLX è il framework di machine learning di Apple progettato per Apple Silicon. Presenta memoria CPU/GPU unificata, valutazione lazy, trasformazioni di funzioni componibili e un’API simile a NumPy. Il pacchetto mlx-lm fornisce strumenti di alto livello per l’inferenza LLM, il fine-tuning e la conversione dei modelli.
Installazione
# Install MLX core
pip install mlx
# Install MLX LM tools (inference, fine-tuning, conversion)
pip install mlx-lm
# Install MLX for vision and audio
pip install mlx-vlm
pip install mlx-whisper
# Verify installation
python -c "import mlx.core as mx; print(mx.default_device())"
Generazione testo LLM
# Generate text from a HuggingFace model (auto-downloads)
mlx_lm.generate \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--prompt "Explain transformers in ML:" \
--max-tokens 256
# With sampling parameters
mlx_lm.generate \
--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--prompt "Write a haiku about coding:" \
--max-tokens 100 \
--temp 0.7 \
--top-p 0.9
# Chat mode
mlx_lm.chat \
--model mlx-community/Llama-3.1-8B-Instruct-4bit
API Python per la generazione
from mlx_lm import load, generate
# Load model and tokenizer
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
# Generate text
prompt = "Explain gradient descent:"
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=256,
temp=0.7,
)
print(response)
# Streaming generation
from mlx_lm import stream_generate
for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=256):
print(token, end="", flush=True)
Conversione modello
# Convert HuggingFace model to MLX format
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./mlx-llama-8b \
--dtype float16
# Convert with 4-bit quantization
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./mlx-llama-8b-4bit \
--quantize \
--q-bits 4 \
--q-group-size 64
# Convert with 8-bit quantization
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
--mlx-path ./mlx-mistral-8bit \
--quantize \
--q-bits 8
Fine-tuning LoRA
# Fine-tune with LoRA
mlx_lm.lora \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--data ./training_data \
--train \
--iters 1000 \
--batch-size 4 \
--lora-layers 16 \
--learning-rate 1e-5
# Resume training from checkpoint
mlx_lm.lora \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--data ./training_data \
--train \
--resume-adapter-file ./adapters/adapters.safetensors \
--iters 500
# Evaluate after training
mlx_lm.lora \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--data ./training_data \
--adapter-path ./adapters \
--test
Fusione adapter LoRA
# Merge LoRA adapters into base model
mlx_lm.fuse \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters \
--save-path ./fused-model
# Fuse and re-quantize
mlx_lm.fuse \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters \
--save-path ./fused-model-4bit \
--de-quantize
Formato dati di training
{"text": "Below is an instruction.\n\n### Instruction:\nExplain gravity.\n\n### Response:\nGravity is a fundamental force..."}
{"text": "Below is an instruction.\n\n### Instruction:\nWhat is DNA?\n\n### Response:\nDNA is a molecule..."}
Posiziona i file di dati come train.jsonl, valid.jsonl e test.jsonl nella tua directory dei dati.
API Core MLX
import mlx.core as mx
# Array creation (like NumPy)
a = mx.array([1.0, 2.0, 3.0])
b = mx.zeros((3, 4))
c = mx.ones((2, 3), dtype=mx.float16)
d = mx.random.normal((4, 4))
# Device placement (unified memory - no explicit transfers)
x = mx.array([1.0, 2.0, 3.0]) # Available on both CPU and GPU
# Basic operations
result = mx.matmul(a.reshape(1, -1), mx.ones((3, 4)))
y = mx.exp(x) + mx.sin(x)
# Lazy evaluation - computations only run when needed
z = mx.add(x, x)
z = mx.multiply(z, 2.0)
mx.eval(z) # Triggers computation
# Automatic differentiation
def loss_fn(x):
return mx.sum(x ** 2)
grad_fn = mx.grad(loss_fn)
grads = grad_fn(mx.array([1.0, 2.0, 3.0]))
Modulo rete neurale MLX
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
# Define a model
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.layers = [
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
model = MLP(784, 256, 10)
# Optimizer
optimizer = optim.Adam(learning_rate=1e-3)
# Training step with value_and_grad
def loss_fn(model, x, y):
logits = model(x)
return mx.mean(nn.losses.cross_entropy(logits, y))
loss_and_grad_fn = nn.value_and_grad(model, loss_fn)
# Training loop
for batch_x, batch_y in dataloader:
loss, grads = loss_and_grad_fn(model, batch_x, batch_y)
optimizer.update(model, grads)
mx.eval(model.parameters(), optimizer.state)
Modalità server
# Start OpenAI-compatible server
mlx_lm.server \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--port 8080
# Query the server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
Suggerimenti prestazioni
# Use mx.compile for repeated operations
@mx.compile
def fast_forward(model, x):
return model(x)
# Use float16 for faster inference
model, tokenizer = load("model-path", dtype=mx.float16)
# Batch processing
inputs = mx.array(batch_of_inputs)
outputs = model(inputs) # Processes entire batch at once
mx.eval(outputs)
MLX vs PyTorch Comparison
| Feature | MLX | PyTorch |
|---|---|---|
| Memory model | Unified (shared CPU/GPU) | Explicit transfers |
| Evaluation | Lazy (deferred) | Eager (immediate) |
| Platform | Apple Silicon only | Cross-platform |
| Array API | NumPy-like | NumPy-like |
| Auto-diff | mx.grad | torch.autograd |
| Compilation | mx.compile | torch.compile |
Comandi comuni
| Task | Command |
|---|---|
| Generate text | mlx_lm.generate --model MODEL --prompt TEXT |
| Interactive chat | mlx_lm.chat --model MODEL |
| Convert model | mlx_lm.convert --hf-path HF_MODEL --mlx-path OUTPUT |
| Quantize model | mlx_lm.convert --hf-path MODEL --quantize --q-bits 4 |
| LoRA fine-tune | mlx_lm.lora --model MODEL --data DIR --train |
| Fuse adapters | mlx_lm.fuse --model MODEL --adapter-path DIR |
| Start server | mlx_lm.server --model MODEL --port 8080 |