Hugging Face is the central platform for the ML community — hosting 500k+ models, 100k+ datasets, and providing libraries that power modern NLP, vision, audio, and multimodal AI. The ecosystem spans training, fine-tuning, inference, and deployment.
Hub: https://huggingface.co
Docs: https://huggingface.co/docs
GitHub: https://github.com/huggingface
Installation
Core Libraries
# Core Transformers (PyTorch backend)
pip install transformers torch
# With TensorFlow backend
pip install transformers tensorflow
# Full ecosystem (recommended for development)
pip install transformers datasets tokenizers accelerate peft
# Hub CLI tool
pip install huggingface_hub
# Text generation inference (self-hosted serving)
pip install text-generation
# For quantization support
pip install bitsandbytes
pip install auto-gptq
pip install autoawq
Hub CLI Authentication
# Login with token from https://huggingface.co/settings/tokens
huggingface-cli login
# Login non-interactively (CI/CD)
huggingface-cli login --token $HF_TOKEN
# Check current user
huggingface-cli whoami
# Set token via environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"
Configuration
Environment Variables
# Authentication
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"
# Cache directory (default: ~/.cache/huggingface/)
export HF_HOME="/data/hf_cache"
export TRANSFORMERS_CACHE="/data/hf_cache/hub"
# Offline mode (use only cached models)
export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1
# Disable progress bars in production
export HF_HUB_DISABLE_PROGRESS_BARS=1
# Mirror for regions with restricted access
export HF_ENDPOINT="https://hf-mirror.com"
# Disable telemetry
export HF_HUB_DISABLE_TELEMETRY=1
Token Configuration File
from huggingface_hub import HfApi
# Programmatic login
from huggingface_hub import login
login(token="hf_xxxxxxxxxxxxxxxxxxxx", add_to_git_credential=True)
# Configure API instance
api = HfApi(token="hf_xxxxxxxxxxxxxxxxxxxx")
Core API / CLI Commands
Hub CLI
| Command | Description |
|---|
huggingface-cli login | Authenticate with HF token |
huggingface-cli logout | Remove stored credentials |
huggingface-cli whoami | Show current authenticated user |
huggingface-cli download <repo> | Download model/dataset repo |
huggingface-cli upload <repo> <path> | Upload file(s) to Hub |
huggingface-cli repo create <name> | Create new repository |
huggingface-cli repo delete <name> | Delete repository |
huggingface-cli lfs-enable-largefiles . | Enable Git LFS for large files |
huggingface-cli scan-cache | Inspect local cache |
huggingface-cli delete-cache | Delete unused cache entries |
huggingface-cli env | Print environment info |
Python Hub API
| Method | Description |
|---|
hf_hub_download(repo_id, filename) | Download single file from Hub |
snapshot_download(repo_id) | Download entire repository |
api.upload_file(path, repo_id, path_in_repo) | Upload single file |
api.upload_folder(folder_path, repo_id) | Upload directory |
api.create_repo(repo_id) | Create new model/dataset repo |
api.delete_repo(repo_id) | Delete repository |
api.list_models(author="username") | List models by author |
api.model_info(repo_id) | Get model metadata |
api.dataset_info(repo_id) | Get dataset metadata |
api.list_repo_files(repo_id) | List files in repo |
| Task | Pipeline Type |
|---|
| Text generation | "text-generation" |
| Text classification | "text-classification" |
| Named entity recognition | "ner" |
| Question answering | "question-answering" |
| Summarization | "summarization" |
| Translation | "translation" |
| Image classification | "image-classification" |
| Object detection | "object-detection" |
| Audio classification | "audio-classification" |
| Automatic speech recognition | "automatic-speech-recognition" |
Advanced Usage
Loading and Running Models
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
# Quick pipeline (auto-downloads model)
generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
result = generator("Once upon a time", max_new_tokens=100)
print(result[0]["generated_text"])
# Full control with AutoClasses
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatically distributes across GPUs
load_in_4bit=True, # 4-bit quantization via bitsandbytes
)
# Tokenize and generate
inputs = tokenizer("Explain transformers:", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Datasets Library
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd
# Load from Hub
ds = load_dataset("squad") # Full dataset
ds = load_dataset("squad", split="train") # Single split
ds = load_dataset("squad", split="train[:1000]") # First 1000 rows
# Load from local files
ds = load_dataset("csv", data_files="data.csv")
ds = load_dataset("json", data_files="data.jsonl")
ds = load_dataset("parquet", data_files="data.parquet")
# Create from Python objects
df = pd.DataFrame({"text": ["hello", "world"], "label": [0, 1]})
ds = Dataset.from_pandas(df)
# Preprocessing
ds = ds.map(lambda x: {"text_len": len(x["text"])})
ds = ds.filter(lambda x: x["text_len"] > 10)
ds = ds.rename_column("label", "labels")
ds = ds.remove_columns(["unused_col"])
ds = ds.train_test_split(test_size=0.2)
# Push to Hub
ds.push_to_hub("username/my-dataset", private=True)
PEFT / LoRA Fine-Tuning
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from transformers import TrainingArguments, Trainer
# Configure LoRA
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Wrap model with LoRA
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
# Training
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
save_strategy="epoch",
push_to_hub=True,
hub_model_id="username/my-lora-model",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
)
trainer.train()
# Load fine-tuned PEFT model
model = PeftModel.from_pretrained(base_model, "username/my-lora-model")
Accelerate (Multi-GPU / Mixed Precision)
from accelerate import Accelerator
from accelerate.utils import set_seed
# Initialize accelerator
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=4,
log_with="wandb",
)
# Prepare all components
model, optimizer, train_loader, scheduler = accelerator.prepare(
model, optimizer, train_loader, scheduler
)
# Training loop
for batch in train_loader:
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# Save
accelerator.wait_for_everyone()
unwrapped = accelerator.unwrap_model(model)
unwrapped.save_pretrained("./output", save_function=accelerator.save)
Hub File Operations
from huggingface_hub import HfApi, hf_hub_download, snapshot_download
api = HfApi()
# Download single file
path = hf_hub_download(
repo_id="meta-llama/Llama-3.2-1B",
filename="config.json",
cache_dir="/data/models",
)
# Download entire repo
local_dir = snapshot_download(
repo_id="mistralai/Mistral-7B-v0.1",
ignore_patterns=["*.bin", "*.h5"], # Skip heavy weights
local_dir="/data/models/mistral",
)
# Upload folder
api.upload_folder(
folder_path="./my-model-output",
repo_id="username/my-model",
repo_type="model",
commit_message="Add fine-tuned weights",
)
# Create model card
from huggingface_hub import ModelCard
card = ModelCard.load("username/my-model")
card.data.license = "apache-2.0"
card.data.language = ["en"]
card.push_to_hub("username/my-model")
Common Workflows
Download a Model for Offline Use
# CLI download
huggingface-cli download meta-llama/Llama-3.2-1B --local-dir ./models/llama
# Download specific files only
huggingface-cli download mistralai/Mistral-7B-v0.1 \
--include "*.safetensors" "config.json" "tokenizer*" \
--local-dir ./models/mistral
# Then use offline
export TRANSFORMERS_OFFLINE=1
python my_script.py
Push a Fine-Tuned Model to Hub
from transformers import AutoModelForCausalLM, AutoTokenizer
model.push_to_hub("username/my-finetuned-model", private=True)
tokenizer.push_to_hub("username/my-finetuned-model")
# Or via save_pretrained + upload
model.save_pretrained("./local-model")
api.upload_folder(
folder_path="./local-model",
repo_id="username/my-finetuned-model",
)
Deploy to Hugging Face Spaces
# Create a Space via CLI
huggingface-cli repo create my-space --type space --space_sdk gradio
# Clone and develop
git clone https://huggingface.co/spaces/username/my-space
cd my-space
# Add app.py (Gradio example)
cat > app.py << 'EOF'
import gradio as gr
from transformers import pipeline
pipe = pipeline("text-generation", model="gpt2")
def generate(prompt):
return pipe(prompt, max_new_tokens=100)[0]["generated_text"]
gr.Interface(fn=generate, inputs="text", outputs="text").launch()
EOF
# Push to deploy
git add . && git commit -m "Add app" && git push
Tokenizer Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# Basic encode/decode
tokens = tokenizer("Hello world!", return_tensors="pt")
text = tokenizer.decode(tokens["input_ids"][0])
# Chat template (instruction models)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Batch tokenization with padding/truncation
batch = tokenizer(
["Hello", "A longer sentence here"],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
Tips and Best Practices
| Topic | Recommendation |
|---|
| Model format | Prefer .safetensors over .bin for security and speed |
| Quantization | Use 4-bit (bitsandbytes) for 7B+ models on consumer GPUs |
| Cache management | Set HF_HOME to a disk with ample space; run huggingface-cli scan-cache regularly |
| Private models | Set TRANSFORMERS_OFFLINE=1 after download to avoid accidental re-downloads |
| Device mapping | Use device_map="auto" for multi-GPU; device_map="cuda:0" for single GPU |
| Batch inference | Use DataLoader + pipeline’s batch_size param for throughput |
| Model cards | Always fill out model card metadata for discoverability and reproducibility |
| Gated models | Accept terms on Hub web UI first; then login() with token that has read access |
| PEFT merging | Call model.merge_and_unload() before export to bake LoRA weights in |
| Reproducibility | Pin library versions: transformers==4.x.x datasets==2.x.x in requirements |
| Memory | Use torch.cuda.empty_cache() and gc.collect() between large model loads |
| Streaming | Use TextIteratorStreamer from transformers for token-by-token streaming |