Hugging Face

Hugging Face is the central platform for the ML community — hosting 500k+ models, 100k+ datasets, and providing libraries that power modern NLP, vision, audio, and multimodal AI. The ecosystem spans training, fine-tuning, inference, and deployment.

Hub: https://huggingface.co
Docs: https://huggingface.co/docs
GitHub: https://github.com/huggingface

Installation

Core Libraries

# Core Transformers (PyTorch backend)
pip install transformers torch

# With TensorFlow backend
pip install transformers tensorflow

# Full ecosystem (recommended for development)
pip install transformers datasets tokenizers accelerate peft

# Hub CLI tool
pip install huggingface_hub

# Text generation inference (self-hosted serving)
pip install text-generation

# For quantization support
pip install bitsandbytes
pip install auto-gptq
pip install autoawq

Hub CLI Authentication

# Login with token from https://huggingface.co/settings/tokens
huggingface-cli login

# Login non-interactively (CI/CD)
huggingface-cli login --token $HF_TOKEN

# Check current user
huggingface-cli whoami

# Set token via environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"

Configuration

Environment Variables

# Authentication
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxx"

# Cache directory (default: ~/.cache/huggingface/)
export HF_HOME="/data/hf_cache"
export TRANSFORMERS_CACHE="/data/hf_cache/hub"

# Offline mode (use only cached models)
export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1

# Disable progress bars in production
export HF_HUB_DISABLE_PROGRESS_BARS=1

# Mirror for regions with restricted access
export HF_ENDPOINT="https://hf-mirror.com"

# Disable telemetry
export HF_HUB_DISABLE_TELEMETRY=1

Token Configuration File

from huggingface_hub import HfApi

# Programmatic login
from huggingface_hub import login
login(token="hf_xxxxxxxxxxxxxxxxxxxx", add_to_git_credential=True)

# Configure API instance
api = HfApi(token="hf_xxxxxxxxxxxxxxxxxxxx")

Core API / CLI Commands

Hub CLI

Command	Description
`huggingface-cli login`	Authenticate with HF token
`huggingface-cli logout`	Remove stored credentials
`huggingface-cli whoami`	Show current authenticated user
`huggingface-cli download <repo>`	Download model/dataset repo
`huggingface-cli upload <repo> <path>`	Upload file(s) to Hub
`huggingface-cli repo create <name>`	Create new repository
`huggingface-cli repo delete <name>`	Delete repository
`huggingface-cli lfs-enable-largefiles .`	Enable Git LFS for large files
`huggingface-cli scan-cache`	Inspect local cache
`huggingface-cli delete-cache`	Delete unused cache entries
`huggingface-cli env`	Print environment info

Python Hub API

Method	Description
`hf_hub_download(repo_id, filename)`	Download single file from Hub
`snapshot_download(repo_id)`	Download entire repository
`api.upload_file(path, repo_id, path_in_repo)`	Upload single file
`api.upload_folder(folder_path, repo_id)`	Upload directory
`api.create_repo(repo_id)`	Create new model/dataset repo
`api.delete_repo(repo_id)`	Delete repository
`api.list_models(author="username")`	List models by author
`api.model_info(repo_id)`	Get model metadata
`api.dataset_info(repo_id)`	Get dataset metadata
`api.list_repo_files(repo_id)`	List files in repo

Transformers Pipeline API

Task	Pipeline Type
Text generation	`"text-generation"`
Text classification	`"text-classification"`
Named entity recognition	`"ner"`
Question answering	`"question-answering"`
Summarization	`"summarization"`
Translation	`"translation"`
Image classification	`"image-classification"`
Object detection	`"object-detection"`
Audio classification	`"audio-classification"`
Automatic speech recognition	`"automatic-speech-recognition"`

Advanced Usage

Loading and Running Models

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# Quick pipeline (auto-downloads model)
generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
result = generator("Once upon a time", max_new_tokens=100)
print(result[0]["generated_text"])

# Full control with AutoClasses
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",          # Automatically distributes across GPUs
    load_in_4bit=True,          # 4-bit quantization via bitsandbytes
)

# Tokenize and generate
inputs = tokenizer("Explain transformers:", return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Datasets Library

from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd

# Load from Hub
ds = load_dataset("squad")                  # Full dataset
ds = load_dataset("squad", split="train")   # Single split
ds = load_dataset("squad", split="train[:1000]")  # First 1000 rows

# Load from local files
ds = load_dataset("csv", data_files="data.csv")
ds = load_dataset("json", data_files="data.jsonl")
ds = load_dataset("parquet", data_files="data.parquet")

# Create from Python objects
df = pd.DataFrame({"text": ["hello", "world"], "label": [0, 1]})
ds = Dataset.from_pandas(df)

# Preprocessing
ds = ds.map(lambda x: {"text_len": len(x["text"])})
ds = ds.filter(lambda x: x["text_len"] > 10)
ds = ds.rename_column("label", "labels")
ds = ds.remove_columns(["unused_col"])
ds = ds.train_test_split(test_size=0.2)

# Push to Hub
ds.push_to_hub("username/my-dataset", private=True)

PEFT / LoRA Fine-Tuning

from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from transformers import TrainingArguments, Trainer

# Configure LoRA
lora_config = LoraConfig(
    r=16,                           # LoRA rank
    lora_alpha=32,                  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Wrap model with LoRA
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# Training
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    save_strategy="epoch",
    push_to_hub=True,
    hub_model_id="username/my-lora-model",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
)
trainer.train()

# Load fine-tuned PEFT model
model = PeftModel.from_pretrained(base_model, "username/my-lora-model")

Accelerate (Multi-GPU / Mixed Precision)

from accelerate import Accelerator
from accelerate.utils import set_seed

# Initialize accelerator
accelerator = Accelerator(
    mixed_precision="bf16",
    gradient_accumulation_steps=4,
    log_with="wandb",
)

# Prepare all components
model, optimizer, train_loader, scheduler = accelerator.prepare(
    model, optimizer, train_loader, scheduler
)

# Training loop
for batch in train_loader:
    with accelerator.accumulate(model):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

# Save
accelerator.wait_for_everyone()
unwrapped = accelerator.unwrap_model(model)
unwrapped.save_pretrained("./output", save_function=accelerator.save)

Hub File Operations

from huggingface_hub import HfApi, hf_hub_download, snapshot_download

api = HfApi()

# Download single file
path = hf_hub_download(
    repo_id="meta-llama/Llama-3.2-1B",
    filename="config.json",
    cache_dir="/data/models",
)

# Download entire repo
local_dir = snapshot_download(
    repo_id="mistralai/Mistral-7B-v0.1",
    ignore_patterns=["*.bin", "*.h5"],  # Skip heavy weights
    local_dir="/data/models/mistral",
)

# Upload folder
api.upload_folder(
    folder_path="./my-model-output",
    repo_id="username/my-model",
    repo_type="model",
    commit_message="Add fine-tuned weights",
)

# Create model card
from huggingface_hub import ModelCard
card = ModelCard.load("username/my-model")
card.data.license = "apache-2.0"
card.data.language = ["en"]
card.push_to_hub("username/my-model")

Common Workflows

Download a Model for Offline Use

# CLI download
huggingface-cli download meta-llama/Llama-3.2-1B --local-dir ./models/llama

# Download specific files only
huggingface-cli download mistralai/Mistral-7B-v0.1 \
  --include "*.safetensors" "config.json" "tokenizer*" \
  --local-dir ./models/mistral

# Then use offline
export TRANSFORMERS_OFFLINE=1
python my_script.py

Push a Fine-Tuned Model to Hub

from transformers import AutoModelForCausalLM, AutoTokenizer

model.push_to_hub("username/my-finetuned-model", private=True)
tokenizer.push_to_hub("username/my-finetuned-model")

# Or via save_pretrained + upload
model.save_pretrained("./local-model")
api.upload_folder(
    folder_path="./local-model",
    repo_id="username/my-finetuned-model",
)

Deploy to Hugging Face Spaces

# Create a Space via CLI
huggingface-cli repo create my-space --type space --space_sdk gradio

# Clone and develop
git clone https://huggingface.co/spaces/username/my-space
cd my-space

# Add app.py (Gradio example)
cat > app.py << 'EOF'
import gradio as gr
from transformers import pipeline

pipe = pipeline("text-generation", model="gpt2")

def generate(prompt):
    return pipe(prompt, max_new_tokens=100)[0]["generated_text"]

gr.Interface(fn=generate, inputs="text", outputs="text").launch()
EOF

# Push to deploy
git add . && git commit -m "Add app" && git push

Tokenizer Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Basic encode/decode
tokens = tokenizer("Hello world!", return_tensors="pt")
text = tokenizer.decode(tokens["input_ids"][0])

# Chat template (instruction models)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Batch tokenization with padding/truncation
batch = tokenizer(
    ["Hello", "A longer sentence here"],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

Tips and Best Practices

Topic	Recommendation
Model format	Prefer `.safetensors` over `.bin` for security and speed
Quantization	Use 4-bit (bitsandbytes) for 7B+ models on consumer GPUs
Cache management	Set `HF_HOME` to a disk with ample space; run `huggingface-cli scan-cache` regularly
Private models	Set `TRANSFORMERS_OFFLINE=1` after download to avoid accidental re-downloads
Device mapping	Use `device_map="auto"` for multi-GPU; `device_map="cuda:0"` for single GPU
Batch inference	Use `DataLoader` + pipeline’s `batch_size` param for throughput
Model cards	Always fill out model card metadata for discoverability and reproducibility
Gated models	Accept terms on Hub web UI first; then `login()` with token that has read access
PEFT merging	Call `model.merge_and_unload()` before export to bake LoRA weights in
Reproducibility	Pin library versions: `transformers==4.x.x datasets==2.x.x` in requirements
Memory	Use `torch.cuda.empty_cache()` and `gc.collect()` between large model loads
Streaming	Use `TextIteratorStreamer` from `transformers` for token-by-token streaming