# Data Cleaning & OpenRefine 

## 1. OpenRefine Overview

- Free, open-source tool for **interactive data cleaning and reconciliation**
- Runs locally in browser (not cloud-based)
- Works like a spreadsheet but far more powerful for messy, inconsistent data
- ❌ NOT for model training
- ❌ NOT for visual dashboarding
- ❌ NOT for source/version control

---

## 2. Data Quality Issues

| Issue | Description | Example |
| :--- | :--- | :--- |
| **Inconsistent naming** | Same entity spelled differently | "Apple", "Apple Inc.", "APPLE" ✅ |
| **Missing values** | NaN, null, empty cells | Rating with no review text |
| **Duplicate records** | Same record appears multiple times | Same customer, different format |
| **Wrong data types** | Numbers stored as text | "001234" converted to date |
| **Outliers** | Extreme values | Age = 999 |
| **Structural errors** | Inconsistent formats | "01/12/2024" vs "2024-12-01" |
| **Extra whitespace** | Leading/trailing spaces | " Apple " not equal to "Apple" |
| **Mixed units** | Different units in same column | USD vs EUR |

- ✅ **Inconsistent naming/capitalization** is the most common data quality issue (exam answer)

---

## 3. Text Clustering in OpenRefine (Exam Critical)

Text clustering automatically groups similar values so you can merge them into one canonical form.

### Clustering Algorithms

| Algorithm | How it Works | Best For |
| :--- | :--- | :--- |
| **Fingerprinting** | Lowercase, remove punctuation, sort words, remove duplicates | Simple case/spacing variations ✅ |
| **N-gram Fingerprint** | Compares character n-grams (bi-grams, tri-grams) | Typos and slight spelling differences |
| **Levenshtein (edit distance)** | Counts minimum edits (insert/delete/replace) | Manual typos ("Microosft") |
| **PPM** | Compression-based similarity | Complex variations |
| **Phonetic** | Groups by how words sound (Metaphone) | Name variations ("John" vs "Jon") |

- ✅ **Fingerprinting + n-gram** = most effective for company name standardization (exam answer)
- ❌ Simple find and replace misses variations
- ❌ Regular expressions alone miss fuzzy matches
- ❌ Manual standardization is not scalable

### Fingerprinting Example

```
"Microsoft Corp"       -> "corp microsoft"
"MICROSOFT CORPORATION" -> "corporation microsoft"
"microsoft"            -> "microsoft"
-> Same fingerprint: "microsoft"  -> all get merged ✅
```

---

## 4. GREL - General Refine Expression Language

Used to transform cell values programmatically in OpenRefine.

### Core Functions

| Function | Effect | Example |
| :--- | :--- | :--- |
| `value.toLowercase()` | APPLE INC -> apple inc | Normalize before clustering |
| `value.toUppercase()` | apple -> APPLE | |
| `value.toTitlecase()` | apple inc -> Apple Inc ✅ | Final standardized form |
| `value.trim()` | Remove leading/trailing spaces ✅ | " Apple " -> "Apple" |
| `value.replace("a","b")` | Replace substring | Replace "Corp" with "Corporation" |
| `value.split(",")[0]` | Split and get first part | "Smith, John" -> "Smith" |
| `value.split(" ")[1]` | Get second word | |
| `value.length()` | Character count | Check for empty values |
| `value.parseJson().get("key")` | Parse JSON in cell | For API response columns |
| `value.toDate()` | Convert string to date | |
| `isBlank(value)` | Check if empty | |

### Chaining (Exam Critical)

```javascript
// Standardize to Title Case - ✅ exam answer
value.trim().toLowercase().toTitlecase()

// Extract domain from email
value.split("@")[1]

// Remove non-numeric characters
value.replace(/\D/, "")

// Conditional
if(value.length() < 5, "Short", value)
```

---

## 5. Faceting - Group & Aggregate

Faceting groups rows by value for exploration and bulk editing.

| Facet Type | Use |
| :--- | :--- |
| Text facet | Group by text value |
| Numeric facet | Filter by number range |
| Timeline facet | Filter by date range |
| Custom facet | Use GREL expression |

### Workflow for Duplicate Aggregation (Exam Answer)

```
1. Standardize names: toLowercase -> toTitlecase
2. Apply Text Facet on company name column
3. See all grouped values and counts
4. Use clustering to merge similar values
5. Group by name -> aggregate/sum revenue
```

- ✅ Use faceting to group and aggregate/sum values (exam answer)
- ❌ Delete all duplicates - loses revenue data
- ❌ Leave duplicates - incorrect totals

---

## 6. Operations as JSON - Reproducibility (Exam Critical)

Every cleaning step in OpenRefine is recorded as an operation. These can be exported as JSON and replayed.

```
Workflow:
1. Clean dataset A in OpenRefine
2. Undo/Redo -> Extract -> Copy JSON
3. Open dataset B in OpenRefine
4. Undo/Redo -> Apply -> paste JSON
5. Dataset B cleaned with same steps ✅
```

- ✅ Export operations as JSON -> replay on similar datasets (exam answer)
- ❌ Operations do NOT automatically clean new data
- ❌ Operations do NOT make the software faster
- ❌ Operations do NOT delete original data

---

## 7. Excel & Spreadsheet Traps (Exam Critical)

### The Leading Zeros / Date Destruction Trap

**Problem**: Opening a CSV in Excel destroys leading zeros and auto-converts strings to dates.
- `"00123"` becomes `123`
- `"3/10"` becomes `March 10`

**Fix: Text-to-Columns Wizard**

```
1. Select the column(s) with problem data
2. Data -> Text to Columns
3. Choose delimiter (Tab, Comma, etc.)
4. Step 3 of wizard: Select Column
5. Set "Column data format" to 'Text' ✅ (NOT 'General')
6. Finish
```

- ✅ Set column format to 'Text' in Text-to-Columns wizard (exam answer)
- ❌ 'General' format will auto-convert and destroy leading zeros

---

## 8. Deduplication Workflow

```python
# Step 1: Normalize
def normalize_name(name):
    if ',' in name:                          # "Smith, John" -> "John Smith"
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name.strip().title()

# Step 2: Apply normalization
df['normalized'] = df['name'].apply(normalize_name)

# Step 3: Deduplicate
df = df.drop_duplicates(subset=['normalized'])

# Keep LATEST record for each customer
df.sort_values('timestamp').drop_duplicates(subset=['customer_id'], keep='last')
```

- ✅ **Normalize first, then deduplicate** (exam answer)
- ❌ Deduplicate before normalizing - misses most duplicates

---

## 9. Fuzzy Matching

Used when exact matching fails due to word order, abbreviations, or typos.

```python
from fuzzywuzzy import fuzz

# Simple ratio (order matters)
fuzz.ratio("Microsoft Corp", "Microsoft Corporation")       # 82

# Token sort ratio (order-independent) ✅
fuzz.token_sort_ratio("Blue T-Shirt", "T-Shirt Blue")      # 100

# Token set ratio (handles extra words)
fuzz.token_set_ratio("Blue T-Shirt Size M", "tshirt blue m")  # high score

# Typical threshold for a match: score > 85
```

### Text Matching Challenges

| Challenge | Example |
| :--- | :--- |
| Word order | "Blue T-Shirt" vs "T-Shirt Blue" |
| Abbreviations | "T-Shirt" vs "tshirt" |
| Case differences | "BLUE" vs "blue" |
| Synonyms | "large" vs "L" |
| Extra words | "Size M" vs "M" |

- ✅ Effective text comparison must account for word order, abbreviations, synonyms (exam answer)

---

## 10. String Normalization (Python)

```python
import re

def normalize_text(text):
    text = text.lower()                        # lowercase
    text = re.sub(r'[^\w\s]', '', text)        # remove punctuation
    text = ' '.join(sorted(text.split()))      # sort words alphabetically
    return text.strip()

# Examples:
# normalize_text("Microsoft Corp")     -> "corp microsoft"
# normalize_text("MICROSOFT CORP")     -> "corp microsoft"  (same!) ✅
```

---

## 11. pandas Data Cleaning Reference

```python
# Detect issues
df.isnull().sum()               # missing values per column ✅
df.isnull().sum() / len(df) * 100  # percentage missing
df.duplicated().sum()           # count duplicates
df.dtypes                       # check data types
df['col'].value_counts()        # see value distribution

# Fix missing values
df.dropna()                             # drop rows with ANY NaN
df.dropna(subset=['review_text'])       # drop where specific column is NaN ✅
df.dropna(how='all')                    # drop only if ALL values are NaN
df.fillna(0)                            # fill with constant
df['col'].fillna(df['col'].mean())      # fill with mean
df['col'].fillna(df['col'].median())    # fill with median (robust to outliers)
df.fillna(method='ffill')              # forward fill (time series)
df.fillna(method='bfill')              # backward fill

# Fix duplicates
df.drop_duplicates()                    # all columns
df.drop_duplicates(subset=['col'])      # specific column ✅

# Fix data types
df['col'].astype('float')
df['col'].astype('str')
df['col'].astype('category')           # memory efficient for low-cardinality
pd.to_datetime(df['date_col'])         # parse dates
pd.to_numeric(df['col'], errors='coerce')  # force numeric, NaN on failure

# Fix strings
df['col'].str.lower()
df['col'].str.strip()
df['col'].str.replace('Inc.', 'Inc')
df['col'].str.contains('pattern')      # boolean mask for filtering
```

---

## 12. Fill Strategy Selection

| Strategy | When to Use |
| :--- | :--- |
| Drop rows | Text analysis, small percentage missing |
| Fill mean | Normally distributed numeric data |
| Fill median | Skewed data, outliers present ✅ |
| Fill mode | Categorical data |
| Forward fill | Time series, ordered data |

---

## 13. Quick Reference Card

```
OpenRefine:
  Purpose       -> interactive data cleaning ✅
  Clustering    -> fingerprinting + n-gram ✅
  GREL          -> value.trim().toLowercase().toTitlecase() ✅
  Faceting      -> group and aggregate duplicates ✅
  Operations    -> export as JSON, replay on new datasets ✅

Most common data issue:
  Inconsistent naming/capitalization ✅

Excel leading zero fix:
  Text-to-Columns -> set column format to 'Text' ✅

Deduplication order:
  Normalize first -> then deduplicate ✅

Fuzzy matching for word order:
  fuzz.token_sort_ratio() ✅

Text matching challenges:
  Word order, abbreviations, synonyms, case ✅
```