# HuggingFace Ecosystem 

> HuggingFace is the central hub for open-source AI in 2026: 800k+ models, `transformers`, `datasets`, `diffusers`, `peft`, and Spaces for deployment : all under one roof.

---

## 1. Ecosystem Overview

```
HuggingFace Hub
├── Models       : 800k+ pretrained models (BERT, Llama, Mistral, Flux, Whisper...)
├── Datasets     : 150k+ datasets (GLUE, WikiText, HumanEval...)
├── Spaces       : Live app hosting (Gradio / Streamlit, free GPU tier)
└── Papers       : arxiv papers linked to model cards

Key Libraries:
  transformers   : load + run any model  (NLP, vision, audio, multimodal)
  datasets       : fast dataset loading & processing (Arrow-backed)
  tokenizers     : fast Rust-backed tokenization
  peft           : parameter-efficient fine-tuning (LoRA, QLoRA, prompt tuning)
  accelerate     : distributed training (multi-GPU, TPU)
  diffusers      : image generation (Stable Diffusion, FLUX)
  evaluate       : metrics (BLEU, ROUGE, accuracy, F1)
  huggingface_hub : programmatic Hub access (upload, download, search)
```

---

## 2. `pipeline()` : Quickest Way to Run a Model

```python
from transformers import pipeline

# Text generation
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct")
out = gen("Write a haiku about Python:", max_new_tokens=50)
print(out[0]['generated_text'])

# Sentiment analysis (default model = distilbert)
classifier = pipeline("sentiment-analysis")
result = classifier("FastAPI makes me happy!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
ner("Sundar Pichai works at Google in Mountain View.")
# [{'entity_group': 'PER', 'word': 'Sundar Pichai', 'score': 0.99}...]

# Question Answering
qa = pipeline("question-answering")
qa(question="Who founded Apple?",
   context="Apple was founded by Steve Jobs in 1976 in California.")
# {'answer': 'Steve Jobs', 'score': 0.98}

# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summarizer("Long article text...", max_length=130, min_length=30)

# Translation
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
translator("Hello, how are you?")

# Zero-shot classification (no training needed!)
clf = pipeline("zero-shot-classification")
clf("I love playing football on weekends",
    candidate_labels=["sports", "cooking", "technology"])
# {'labels': ['sports', 'technology', 'cooking'], 'scores': [0.97, 0.02, 0.01]}

# Image classification
img_clf = pipeline("image-classification", model="google/vit-base-patch16-224")
img_clf("cat.jpg")

# Speech-to-text (Whisper)
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
asr("audio.mp3")

# Specify device
pipe = pipeline("text-generation", model="...", device="cuda")  # GPU
pipe = pipeline("text-generation", model="...", device="cpu")   # CPU (default)
pipe = pipeline("text-generation", model="...", device_map="auto")  # best available
```

### Pipeline Task Names (Exam Reference)

| Task | pipeline() string |
| :--- | :--- |
| Text generation | `"text-generation"` |
| Text classification / sentiment | `"text-classification"` or `"sentiment-analysis"` |
| NER / token classification | `"ner"` or `"token-classification"` |
| Question answering | `"question-answering"` |
| Summarization | `"summarization"` |
| Translation | `"translation_en_to_fr"` (language pair in name) |
| Fill-mask (BERT-style) | `"fill-mask"` |
| Zero-shot classification | `"zero-shot-classification"` |
| Feature extraction (embeddings) | `"feature-extraction"` |
| Image classification | `"image-classification"` |
| Object detection | `"object-detection"` |
| Image-to-text / captioning | `"image-to-text"` |
| Speech recognition | `"automatic-speech-recognition"` |
| Text-to-speech | `"text-to-speech"` |

---

## 3. AutoModel + AutoTokenizer : Full Control

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
inputs = tokenizer(
    "I love this movie!",
    return_tensors="pt",    # "pt"=PyTorch, "tf"=TensorFlow, "np"=numpy
    truncation=True,
    max_length=512,
    padding=True
)
# inputs = {'input_ids': tensor([[...]]), 'attention_mask': tensor([[...]])}

# Run model
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits                    # raw scores
probs = torch.softmax(logits, dim=-1)     # convert to probabilities
pred_class = probs.argmax().item()         # 0=NEGATIVE, 1=POSITIVE
label = model.config.id2label[pred_class]  # "POSITIVE"
```

### AutoModel Class → Task Mapping

| Task | AutoModel class |
| :--- | :--- |
| Sequence classification | `AutoModelForSequenceClassification` |
| Token classification (NER) | `AutoModelForTokenClassification` |
| Question answering | `AutoModelForQuestionAnswering` |
| Causal LM (text generation) | `AutoModelForCausalLM` |
| Seq2Seq (summarization, translation) | `AutoModelForSeq2SeqLM` |
| Masked LM (BERT fill-mask) | `AutoModelForMaskedLM` |
| Feature extraction (no head) | `AutoModel` |
| Image classification | `AutoModelForImageClassification` |

---

## 4. Tokenization Deep Dive

```python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization
tokens = tokenizer.tokenize("Hello world!")          # ['hello', 'world', '!']
ids = tokenizer.convert_tokens_to_ids(tokens)        # [7592, 2088, 999]

# Full encoding (model-ready)
enc = tokenizer("Hello world!", return_tensors="pt")
# enc.input_ids       → token IDs incl. [CLS]=101, [SEP]=102
# enc.attention_mask  → 1 for real tokens, 0 for padding

# Batch encoding with padding + truncation
batch = tokenizer(
    ["Short sentence.", "A much longer sentence that needs truncation..."],
    padding=True,       # pad shorter to match longest ✅
    truncation=True,    # truncate if > max_length
    max_length=128,
    return_tensors="pt"
)

# Decode back
tokenizer.decode([7592, 2088, 999])  # "hello world !"
tokenizer.decode(enc.input_ids[0])   # "[CLS] hello world ! [SEP]"

# Special tokens
tokenizer.cls_token    # "[CLS]"
tokenizer.sep_token    # "[SEP]"
tokenizer.pad_token    # "[PAD]"
tokenizer.unk_token    # "[UNK]"
tokenizer.vocab_size   # e.g. 30522 for BERT
```

---

## 5. Datasets Library

```python
from datasets import load_dataset, Dataset
import pandas as pd

# Load from Hub
dataset = load_dataset("imdb")
# DatasetDict with 'train', 'test' splits

# Access splits
train = dataset['train']                     # Dataset object
print(train[0])                              # first example as dict
print(train['text'][:3])                     # first 3 texts (column access)
print(train.features)                        # schema: {'text': Value('string'), 'label': ClassLabel}

# Filter + Map (lazy, Arrow-backed : very fast)
filtered = train.filter(lambda x: len(x['text']) > 100)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_fn(batch):
    return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=128)

tokenized = train.map(tokenize_fn, batched=True)   # batched=True → much faster ✅

# Convert to/from pandas
df = train.to_pandas()
ds = Dataset.from_pandas(df)

# Create dataset from lists
ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})

# Load local files
ds = load_dataset("csv", data_files="data.csv")
ds = load_dataset("json", data_files="data.jsonl")
ds = load_dataset("parquet", data_files="data.parquet")

# Shuffle + split
ds = ds.shuffle(seed=42)
split = ds.train_test_split(test_size=0.2)
```

---

## 6. HuggingFace Hub : Model Discovery & Download

```python
from huggingface_hub import hf_hub_download, snapshot_download, list_models
from huggingface_hub import login, HfApi

# Login (needed for gated models like Llama)
login(token="hf_xxxxxxxxxxxxx")

# Download a single file
path = hf_hub_download(
    repo_id="google/flan-t5-base",
    filename="config.json"
)

# Download entire model to local cache
snapshot_download(repo_id="mistralai/Mistral-7B-v0.1")

# Search models programmatically
api = HfApi()
models = api.list_models(filter="text-classification", sort="downloads", limit=10)
for m in models:
    print(m.id, m.downloads)

# Upload model to Hub
api.upload_file(
    path_or_fileobj="model.pkl",
    path_in_repo="model.pkl",
    repo_id="your-username/my-model",
    repo_type="model"
)

# Model card info
from huggingface_hub import model_info
info = model_info("bert-base-uncased")
print(info.tags, info.pipeline_tag, info.downloads)
```

### Cache Location

```bash
# Default HF cache
~/.cache/huggingface/hub/

# Override cache location
export HF_HOME=/path/to/custom/cache
export TRANSFORMERS_CACHE=/path/to/custom/cache

# List cached models
huggingface-cli scan-cache

# Delete specific model from cache
huggingface-cli delete-cache
```

---

## 7. Inference API (Free & Serverless)

```python
import requests

API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": "Bearer hf_xxxxx"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "This course is amazing!"})
# [[{'label': 'POSITIVE', 'score': 0.999}]]

# Text generation
gen_url = "https://api-inference.huggingface.co/models/gpt2"
out = query({"inputs": "The future of AI is", "parameters": {"max_new_tokens": 50}})
```

### Using `huggingface_hub.InferenceClient` (Better)

```python
from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_xxxxx")

# Text generation
result = client.text_generation(
    "Write a Python function to sort a list",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    max_new_tokens=200
)

# Chat completion (OpenAI-compatible)
messages = [{"role": "user", "content": "Explain transformers in 2 sentences"}]
result = client.chat_completion(
    messages=messages,
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    max_tokens=200
)
print(result.choices[0].message.content)
```

---

## 8. Fine-Tuning with Trainer API

```python
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                           TrainingArguments, Trainer)
from datasets import load_dataset
import numpy as np
import evaluate

# Load model with classification head (num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Prepare data
dataset = load_dataset("imdb")
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=256)

tokenized_ds = dataset.map(tokenize, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=True,            # mixed precision (saves GPU memory) ✅
    report_to="none",     # disable wandb; use "wandb" to enable
)

# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

# Train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("./fine-tuned-bert")
```

---

## 9. PEFT : Parameter-Efficient Fine-Tuning (LoRA)

LoRA freezes the original model and trains small low-rank matrices : 100x fewer parameters.

```python
from peft import LoraConfig, get_peft_model, TaskType

# Base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")

# LoRA config
lora_config = LoraConfig(
    r=16,                          # rank of low-rank matrices (4-64 typical)
    lora_alpha=32,                 # scaling factor (= r * 2 is common)
    target_modules=["q_proj", "v_proj"],  # which layers to apply LoRA
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Wrap model with LoRA
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,034,877,440 || trainable%: 0.05%

# Train with Trainer as usual...

# Save only LoRA weights (very small)
peft_model.save_pretrained("./lora-weights")

# Load: base model + LoRA adapter
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = PeftModel.from_pretrained(base_model, "./lora-weights")
```

### LoRA Key Concepts

| Concept | Explanation |
| :--- | :--- |
| **r (rank)** | Dimension of low-rank decomposition. Lower = fewer params, less capacity |
| **lora_alpha** | Scaling factor; effective LR scaled by alpha/r |
| **target_modules** | Layers to apply LoRA (attention Q/K/V/O projections usually) |
| **QLoRA** | LoRA + 4-bit quantization → 8B model fits in ~6GB VRAM |
| **Merge & unload** | `model.merge_and_unload()` → merge LoRA into base model for inference |

---

## 10. HuggingFace Spaces (Deployment)

Spaces = free hosting for Gradio/Streamlit apps, with optional GPU.

```
Create a Space:
  1. Go to huggingface.co/new-space
  2. Choose SDK: Gradio or Streamlit
  3. Push code via git or web UI

Minimum files needed:
  app.py        ← your Gradio/Streamlit app
  requirements.txt  ← dependencies

Free tier: CPU-only, 16GB RAM
GPU tier:  T4-small (free limited), A10G (paid)
```

```python
# app.py : minimal Gradio Space
import gradio as gr
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def analyze(text):
    result = classifier(text)[0]
    return f"{result['label']} ({result['score']:.2%})"

demo = gr.Interface(
    fn=analyze,
    inputs=gr.Textbox(label="Text"),
    outputs=gr.Label(label="Sentiment"),
    title="Sentiment Analyzer"
)

demo.launch()
```

---

## 11. Embeddings with Sentence-Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")   # 80MB, fast, good quality

sentences = ["I love coding in Python", "Python programming is great", "I enjoy cooking"]
embeddings = model.encode(sentences)   # shape: (3, 384)

# Cosine similarity
from sentence_transformers import util
scores = util.cos_sim(embeddings[0], embeddings)  # compare first to all
# tensor([[1.0, 0.85, 0.21]])
# "I love coding" ↔ "Python programming is great" → 0.85 (high similarity) ✅
# "I love coding" ↔ "I enjoy cooking" → 0.21 (low similarity) ✅

# Semantic search
query = "best way to write Python APIs"
query_emb = model.encode(query)
hits = util.semantic_search(query_emb, embeddings, top_k=2)
```

---

## 12. Quantization : Running Large Models on CPU/Small GPU

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization (QLoRA / inference)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# 8B model: FP32=32GB → BF16=16GB → 8-bit=8GB → 4-bit=4-5GB ✅
```

---

## 13. Quick Reference

```
Popular models by task:
  Text gen      → meta-llama/Meta-Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2
  Embeddings    → sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en
  NER           → dslim/bert-base-NER
  Sentiment     → distilbert-base-uncased-finetuned-sst-2-english
  Summarization → facebook/bart-large-cnn, google/pegasus-xsum
  Translation   → Helsinki-NLP/opus-mt-{src}-{tgt}
  Speech-to-text → openai/whisper-large-v3
  Image-to-text  → Salesforce/blip-image-captioning-base
  Code gen      → Qwen/Qwen2.5-Coder-7B-Instruct, deepseek-ai/deepseek-coder-6.7b-instruct
```

| Need | Solution |
| :--- | :--- |
| Run a model instantly, no config | `pipeline("task", model="...")` |
| Full PyTorch control | `AutoModel` + `AutoTokenizer` |
| Load large dataset fast | `load_dataset()` + `.map(batched=True)` |
| Fine-tune BERT/classifier | `Trainer` + `TrainingArguments` |
| Fine-tune LLM cheaply | `peft` LoRA + `Trainer` |
| Run 8B model on 6GB VRAM | BitsAndBytesConfig 4-bit |
| Deploy model as web app | HuggingFace Spaces (Gradio) |
| Call model without GPU | Inference API / InferenceClient |
| Semantic search / RAG | `sentence-transformers` |