# HuggingFace Ecosystem > HuggingFace is the central hub for open-source AI in 2026: 800k+ models, `transformers`, `datasets`, `diffusers`, `peft`, and Spaces for deployment : all under one roof. --- ## 1. Ecosystem Overview ``` HuggingFace Hub ├── Models : 800k+ pretrained models (BERT, Llama, Mistral, Flux, Whisper...) ├── Datasets : 150k+ datasets (GLUE, WikiText, HumanEval...) ├── Spaces : Live app hosting (Gradio / Streamlit, free GPU tier) └── Papers : arxiv papers linked to model cards Key Libraries: transformers : load + run any model (NLP, vision, audio, multimodal) datasets : fast dataset loading & processing (Arrow-backed) tokenizers : fast Rust-backed tokenization peft : parameter-efficient fine-tuning (LoRA, QLoRA, prompt tuning) accelerate : distributed training (multi-GPU, TPU) diffusers : image generation (Stable Diffusion, FLUX) evaluate : metrics (BLEU, ROUGE, accuracy, F1) huggingface_hub : programmatic Hub access (upload, download, search) ``` --- ## 2. `pipeline()` : Quickest Way to Run a Model ```python from transformers import pipeline # Text generation gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct") out = gen("Write a haiku about Python:", max_new_tokens=50) print(out[0]['generated_text']) # Sentiment analysis (default model = distilbert) classifier = pipeline("sentiment-analysis") result = classifier("FastAPI makes me happy!") # [{'label': 'POSITIVE', 'score': 0.9998}] # Named Entity Recognition ner = pipeline("ner", grouped_entities=True) ner("Sundar Pichai works at Google in Mountain View.") # [{'entity_group': 'PER', 'word': 'Sundar Pichai', 'score': 0.99}...] # Question Answering qa = pipeline("question-answering") qa(question="Who founded Apple?", context="Apple was founded by Steve Jobs in 1976 in California.") # {'answer': 'Steve Jobs', 'score': 0.98} # Summarization summarizer = pipeline("summarization", model="facebook/bart-large-cnn") summarizer("Long article text...", max_length=130, min_length=30) # Translation translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") translator("Hello, how are you?") # Zero-shot classification (no training needed!) clf = pipeline("zero-shot-classification") clf("I love playing football on weekends", candidate_labels=["sports", "cooking", "technology"]) # {'labels': ['sports', 'technology', 'cooking'], 'scores': [0.97, 0.02, 0.01]} # Image classification img_clf = pipeline("image-classification", model="google/vit-base-patch16-224") img_clf("cat.jpg") # Speech-to-text (Whisper) asr = pipeline("automatic-speech-recognition", model="openai/whisper-base") asr("audio.mp3") # Specify device pipe = pipeline("text-generation", model="...", device="cuda") # GPU pipe = pipeline("text-generation", model="...", device="cpu") # CPU (default) pipe = pipeline("text-generation", model="...", device_map="auto") # best available ``` ### Pipeline Task Names (Exam Reference) | Task | pipeline() string | | :--- | :--- | | Text generation | `"text-generation"` | | Text classification / sentiment | `"text-classification"` or `"sentiment-analysis"` | | NER / token classification | `"ner"` or `"token-classification"` | | Question answering | `"question-answering"` | | Summarization | `"summarization"` | | Translation | `"translation_en_to_fr"` (language pair in name) | | Fill-mask (BERT-style) | `"fill-mask"` | | Zero-shot classification | `"zero-shot-classification"` | | Feature extraction (embeddings) | `"feature-extraction"` | | Image classification | `"image-classification"` | | Object detection | `"object-detection"` | | Image-to-text / captioning | `"image-to-text"` | | Speech recognition | `"automatic-speech-recognition"` | | Text-to-speech | `"text-to-speech"` | --- ## 3. AutoModel + AutoTokenizer : Full Control ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "distilbert-base-uncased-finetuned-sst-2-english" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Tokenize input inputs = tokenizer( "I love this movie!", return_tensors="pt", # "pt"=PyTorch, "tf"=TensorFlow, "np"=numpy truncation=True, max_length=512, padding=True ) # inputs = {'input_ids': tensor([[...]]), 'attention_mask': tensor([[...]])} # Run model with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # raw scores probs = torch.softmax(logits, dim=-1) # convert to probabilities pred_class = probs.argmax().item() # 0=NEGATIVE, 1=POSITIVE label = model.config.id2label[pred_class] # "POSITIVE" ``` ### AutoModel Class → Task Mapping | Task | AutoModel class | | :--- | :--- | | Sequence classification | `AutoModelForSequenceClassification` | | Token classification (NER) | `AutoModelForTokenClassification` | | Question answering | `AutoModelForQuestionAnswering` | | Causal LM (text generation) | `AutoModelForCausalLM` | | Seq2Seq (summarization, translation) | `AutoModelForSeq2SeqLM` | | Masked LM (BERT fill-mask) | `AutoModelForMaskedLM` | | Feature extraction (no head) | `AutoModel` | | Image classification | `AutoModelForImageClassification` | --- ## 4. Tokenization Deep Dive ```python tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Basic tokenization tokens = tokenizer.tokenize("Hello world!") # ['hello', 'world', '!'] ids = tokenizer.convert_tokens_to_ids(tokens) # [7592, 2088, 999] # Full encoding (model-ready) enc = tokenizer("Hello world!", return_tensors="pt") # enc.input_ids → token IDs incl. [CLS]=101, [SEP]=102 # enc.attention_mask → 1 for real tokens, 0 for padding # Batch encoding with padding + truncation batch = tokenizer( ["Short sentence.", "A much longer sentence that needs truncation..."], padding=True, # pad shorter to match longest ✅ truncation=True, # truncate if > max_length max_length=128, return_tensors="pt" ) # Decode back tokenizer.decode([7592, 2088, 999]) # "hello world !" tokenizer.decode(enc.input_ids[0]) # "[CLS] hello world ! [SEP]" # Special tokens tokenizer.cls_token # "[CLS]" tokenizer.sep_token # "[SEP]" tokenizer.pad_token # "[PAD]" tokenizer.unk_token # "[UNK]" tokenizer.vocab_size # e.g. 30522 for BERT ``` --- ## 5. Datasets Library ```python from datasets import load_dataset, Dataset import pandas as pd # Load from Hub dataset = load_dataset("imdb") # DatasetDict with 'train', 'test' splits # Access splits train = dataset['train'] # Dataset object print(train[0]) # first example as dict print(train['text'][:3]) # first 3 texts (column access) print(train.features) # schema: {'text': Value('string'), 'label': ClassLabel} # Filter + Map (lazy, Arrow-backed : very fast) filtered = train.filter(lambda x: len(x['text']) > 100) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def tokenize_fn(batch): return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=128) tokenized = train.map(tokenize_fn, batched=True) # batched=True → much faster ✅ # Convert to/from pandas df = train.to_pandas() ds = Dataset.from_pandas(df) # Create dataset from lists ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]}) # Load local files ds = load_dataset("csv", data_files="data.csv") ds = load_dataset("json", data_files="data.jsonl") ds = load_dataset("parquet", data_files="data.parquet") # Shuffle + split ds = ds.shuffle(seed=42) split = ds.train_test_split(test_size=0.2) ``` --- ## 6. HuggingFace Hub : Model Discovery & Download ```python from huggingface_hub import hf_hub_download, snapshot_download, list_models from huggingface_hub import login, HfApi # Login (needed for gated models like Llama) login(token="hf_xxxxxxxxxxxxx") # Download a single file path = hf_hub_download( repo_id="google/flan-t5-base", filename="config.json" ) # Download entire model to local cache snapshot_download(repo_id="mistralai/Mistral-7B-v0.1") # Search models programmatically api = HfApi() models = api.list_models(filter="text-classification", sort="downloads", limit=10) for m in models: print(m.id, m.downloads) # Upload model to Hub api.upload_file( path_or_fileobj="model.pkl", path_in_repo="model.pkl", repo_id="your-username/my-model", repo_type="model" ) # Model card info from huggingface_hub import model_info info = model_info("bert-base-uncased") print(info.tags, info.pipeline_tag, info.downloads) ``` ### Cache Location ```bash # Default HF cache ~/.cache/huggingface/hub/ # Override cache location export HF_HOME=/path/to/custom/cache export TRANSFORMERS_CACHE=/path/to/custom/cache # List cached models huggingface-cli scan-cache # Delete specific model from cache huggingface-cli delete-cache ``` --- ## 7. Inference API (Free & Serverless) ```python import requests API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english" headers = {"Authorization": "Bearer hf_xxxxx"} def query(payload): response = requests.post(API_URL, headers=headers, json=payload) return response.json() output = query({"inputs": "This course is amazing!"}) # [[{'label': 'POSITIVE', 'score': 0.999}]] # Text generation gen_url = "https://api-inference.huggingface.co/models/gpt2" out = query({"inputs": "The future of AI is", "parameters": {"max_new_tokens": 50}}) ``` ### Using `huggingface_hub.InferenceClient` (Better) ```python from huggingface_hub import InferenceClient client = InferenceClient(token="hf_xxxxx") # Text generation result = client.text_generation( "Write a Python function to sort a list", model="mistralai/Mistral-7B-Instruct-v0.2", max_new_tokens=200 ) # Chat completion (OpenAI-compatible) messages = [{"role": "user", "content": "Explain transformers in 2 sentences"}] result = client.chat_completion( messages=messages, model="meta-llama/Meta-Llama-3-8B-Instruct", max_tokens=200 ) print(result.choices[0].message.content) ``` --- ## 8. Fine-Tuning with Trainer API ```python from transformers import (AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer) from datasets import load_dataset import numpy as np import evaluate # Load model with classification head (num_labels=2) model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 ) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Prepare data dataset = load_dataset("imdb") def tokenize(batch): return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=256) tokenized_ds = dataset.map(tokenize, batched=True) # Training arguments args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, fp16=True, # mixed precision (saves GPU memory) ✅ report_to="none", # disable wandb; use "wandb" to enable ) # Metrics accuracy = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) return accuracy.compute(predictions=preds, references=labels) # Train trainer = Trainer( model=model, args=args, train_dataset=tokenized_ds["train"], eval_dataset=tokenized_ds["test"], compute_metrics=compute_metrics, ) trainer.train() trainer.save_model("./fine-tuned-bert") ``` --- ## 9. PEFT : Parameter-Efficient Fine-Tuning (LoRA) LoRA freezes the original model and trains small low-rank matrices : 100x fewer parameters. ```python from peft import LoraConfig, get_peft_model, TaskType # Base model model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B") # LoRA config lora_config = LoraConfig( r=16, # rank of low-rank matrices (4-64 typical) lora_alpha=32, # scaling factor (= r * 2 is common) target_modules=["q_proj", "v_proj"], # which layers to apply LoRA lora_dropout=0.1, bias="none", task_type=TaskType.CAUSAL_LM ) # Wrap model with LoRA peft_model = get_peft_model(model, lora_config) peft_model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 8,034,877,440 || trainable%: 0.05% # Train with Trainer as usual... # Save only LoRA weights (very small) peft_model.save_pretrained("./lora-weights") # Load: base model + LoRA adapter from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B") model = PeftModel.from_pretrained(base_model, "./lora-weights") ``` ### LoRA Key Concepts | Concept | Explanation | | :--- | :--- | | **r (rank)** | Dimension of low-rank decomposition. Lower = fewer params, less capacity | | **lora_alpha** | Scaling factor; effective LR scaled by alpha/r | | **target_modules** | Layers to apply LoRA (attention Q/K/V/O projections usually) | | **QLoRA** | LoRA + 4-bit quantization → 8B model fits in ~6GB VRAM | | **Merge & unload** | `model.merge_and_unload()` → merge LoRA into base model for inference | --- ## 10. HuggingFace Spaces (Deployment) Spaces = free hosting for Gradio/Streamlit apps, with optional GPU. ``` Create a Space: 1. Go to huggingface.co/new-space 2. Choose SDK: Gradio or Streamlit 3. Push code via git or web UI Minimum files needed: app.py ← your Gradio/Streamlit app requirements.txt ← dependencies Free tier: CPU-only, 16GB RAM GPU tier: T4-small (free limited), A10G (paid) ``` ```python # app.py : minimal Gradio Space import gradio as gr from transformers import pipeline classifier = pipeline("sentiment-analysis") def analyze(text): result = classifier(text)[0] return f"{result['label']} ({result['score']:.2%})" demo = gr.Interface( fn=analyze, inputs=gr.Textbox(label="Text"), outputs=gr.Label(label="Sentiment"), title="Sentiment Analyzer" ) demo.launch() ``` --- ## 11. Embeddings with Sentence-Transformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") # 80MB, fast, good quality sentences = ["I love coding in Python", "Python programming is great", "I enjoy cooking"] embeddings = model.encode(sentences) # shape: (3, 384) # Cosine similarity from sentence_transformers import util scores = util.cos_sim(embeddings[0], embeddings) # compare first to all # tensor([[1.0, 0.85, 0.21]]) # "I love coding" ↔ "Python programming is great" → 0.85 (high similarity) ✅ # "I love coding" ↔ "I enjoy cooking" → 0.21 (low similarity) ✅ # Semantic search query = "best way to write Python APIs" query_emb = model.encode(query) hits = util.semantic_search(query_emb, embeddings, top_k=2) ``` --- ## 12. Quantization : Running Large Models on CPU/Small GPU ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # 4-bit quantization (QLoRA / inference) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B", quantization_config=bnb_config, device_map="auto" ) # 8B model: FP32=32GB → BF16=16GB → 8-bit=8GB → 4-bit=4-5GB ✅ ``` --- ## 13. Quick Reference ``` Popular models by task: Text gen → meta-llama/Meta-Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2 Embeddings → sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-large-en NER → dslim/bert-base-NER Sentiment → distilbert-base-uncased-finetuned-sst-2-english Summarization → facebook/bart-large-cnn, google/pegasus-xsum Translation → Helsinki-NLP/opus-mt-{src}-{tgt} Speech-to-text → openai/whisper-large-v3 Image-to-text → Salesforce/blip-image-captioning-base Code gen → Qwen/Qwen2.5-Coder-7B-Instruct, deepseek-ai/deepseek-coder-6.7b-instruct ``` | Need | Solution | | :--- | :--- | | Run a model instantly, no config | `pipeline("task", model="...")` | | Full PyTorch control | `AutoModel` + `AutoTokenizer` | | Load large dataset fast | `load_dataset()` + `.map(batched=True)` | | Fine-tune BERT/classifier | `Trainer` + `TrainingArguments` | | Fine-tune LLM cheaply | `peft` LoRA + `Trainer` | | Run 8B model on 6GB VRAM | BitsAndBytesConfig 4-bit | | Deploy model as web app | HuggingFace Spaces (Gradio) | | Call model without GPU | Inference API / InferenceClient | | Semantic search / RAG | `sentence-transformers` |