# Data Cleaning & OpenRefine ## 1. OpenRefine Overview - Free, open-source tool for **interactive data cleaning and reconciliation** - Runs locally in browser (not cloud-based) - Works like a spreadsheet but far more powerful for messy, inconsistent data - ❌ NOT for model training - ❌ NOT for visual dashboarding - ❌ NOT for source/version control --- ## 2. Data Quality Issues | Issue | Description | Example | | :--- | :--- | :--- | | **Inconsistent naming** | Same entity spelled differently | "Apple", "Apple Inc.", "APPLE" ✅ | | **Missing values** | NaN, null, empty cells | Rating with no review text | | **Duplicate records** | Same record appears multiple times | Same customer, different format | | **Wrong data types** | Numbers stored as text | "001234" converted to date | | **Outliers** | Extreme values | Age = 999 | | **Structural errors** | Inconsistent formats | "01/12/2024" vs "2024-12-01" | | **Extra whitespace** | Leading/trailing spaces | " Apple " not equal to "Apple" | | **Mixed units** | Different units in same column | USD vs EUR | - ✅ **Inconsistent naming/capitalization** is the most common data quality issue (exam answer) --- ## 3. Text Clustering in OpenRefine (Exam Critical) Text clustering automatically groups similar values so you can merge them into one canonical form. ### Clustering Algorithms | Algorithm | How it Works | Best For | | :--- | :--- | :--- | | **Fingerprinting** | Lowercase, remove punctuation, sort words, remove duplicates | Simple case/spacing variations ✅ | | **N-gram Fingerprint** | Compares character n-grams (bi-grams, tri-grams) | Typos and slight spelling differences | | **Levenshtein (edit distance)** | Counts minimum edits (insert/delete/replace) | Manual typos ("Microosft") | | **PPM** | Compression-based similarity | Complex variations | | **Phonetic** | Groups by how words sound (Metaphone) | Name variations ("John" vs "Jon") | - ✅ **Fingerprinting + n-gram** = most effective for company name standardization (exam answer) - ❌ Simple find and replace misses variations - ❌ Regular expressions alone miss fuzzy matches - ❌ Manual standardization is not scalable ### Fingerprinting Example ``` "Microsoft Corp" -> "corp microsoft" "MICROSOFT CORPORATION" -> "corporation microsoft" "microsoft" -> "microsoft" -> Same fingerprint: "microsoft" -> all get merged ✅ ``` --- ## 4. GREL - General Refine Expression Language Used to transform cell values programmatically in OpenRefine. ### Core Functions | Function | Effect | Example | | :--- | :--- | :--- | | `value.toLowercase()` | APPLE INC -> apple inc | Normalize before clustering | | `value.toUppercase()` | apple -> APPLE | | | `value.toTitlecase()` | apple inc -> Apple Inc ✅ | Final standardized form | | `value.trim()` | Remove leading/trailing spaces ✅ | " Apple " -> "Apple" | | `value.replace("a","b")` | Replace substring | Replace "Corp" with "Corporation" | | `value.split(",")[0]` | Split and get first part | "Smith, John" -> "Smith" | | `value.split(" ")[1]` | Get second word | | | `value.length()` | Character count | Check for empty values | | `value.parseJson().get("key")` | Parse JSON in cell | For API response columns | | `value.toDate()` | Convert string to date | | | `isBlank(value)` | Check if empty | | ### Chaining (Exam Critical) ```javascript // Standardize to Title Case - ✅ exam answer value.trim().toLowercase().toTitlecase() // Extract domain from email value.split("@")[1] // Remove non-numeric characters value.replace(/\D/, "") // Conditional if(value.length() < 5, "Short", value) ``` --- ## 5. Faceting - Group & Aggregate Faceting groups rows by value for exploration and bulk editing. | Facet Type | Use | | :--- | :--- | | Text facet | Group by text value | | Numeric facet | Filter by number range | | Timeline facet | Filter by date range | | Custom facet | Use GREL expression | ### Workflow for Duplicate Aggregation (Exam Answer) ``` 1. Standardize names: toLowercase -> toTitlecase 2. Apply Text Facet on company name column 3. See all grouped values and counts 4. Use clustering to merge similar values 5. Group by name -> aggregate/sum revenue ``` - ✅ Use faceting to group and aggregate/sum values (exam answer) - ❌ Delete all duplicates - loses revenue data - ❌ Leave duplicates - incorrect totals --- ## 6. Operations as JSON - Reproducibility (Exam Critical) Every cleaning step in OpenRefine is recorded as an operation. These can be exported as JSON and replayed. ``` Workflow: 1. Clean dataset A in OpenRefine 2. Undo/Redo -> Extract -> Copy JSON 3. Open dataset B in OpenRefine 4. Undo/Redo -> Apply -> paste JSON 5. Dataset B cleaned with same steps ✅ ``` - ✅ Export operations as JSON -> replay on similar datasets (exam answer) - ❌ Operations do NOT automatically clean new data - ❌ Operations do NOT make the software faster - ❌ Operations do NOT delete original data --- ## 7. Excel & Spreadsheet Traps (Exam Critical) ### The Leading Zeros / Date Destruction Trap **Problem**: Opening a CSV in Excel destroys leading zeros and auto-converts strings to dates. - `"00123"` becomes `123` - `"3/10"` becomes `March 10` **Fix: Text-to-Columns Wizard** ``` 1. Select the column(s) with problem data 2. Data -> Text to Columns 3. Choose delimiter (Tab, Comma, etc.) 4. Step 3 of wizard: Select Column 5. Set "Column data format" to 'Text' ✅ (NOT 'General') 6. Finish ``` - ✅ Set column format to 'Text' in Text-to-Columns wizard (exam answer) - ❌ 'General' format will auto-convert and destroy leading zeros --- ## 8. Deduplication Workflow ```python # Step 1: Normalize def normalize_name(name): if ',' in name: # "Smith, John" -> "John Smith" parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name.strip().title() # Step 2: Apply normalization df['normalized'] = df['name'].apply(normalize_name) # Step 3: Deduplicate df = df.drop_duplicates(subset=['normalized']) # Keep LATEST record for each customer df.sort_values('timestamp').drop_duplicates(subset=['customer_id'], keep='last') ``` - ✅ **Normalize first, then deduplicate** (exam answer) - ❌ Deduplicate before normalizing - misses most duplicates --- ## 9. Fuzzy Matching Used when exact matching fails due to word order, abbreviations, or typos. ```python from fuzzywuzzy import fuzz # Simple ratio (order matters) fuzz.ratio("Microsoft Corp", "Microsoft Corporation") # 82 # Token sort ratio (order-independent) ✅ fuzz.token_sort_ratio("Blue T-Shirt", "T-Shirt Blue") # 100 # Token set ratio (handles extra words) fuzz.token_set_ratio("Blue T-Shirt Size M", "tshirt blue m") # high score # Typical threshold for a match: score > 85 ``` ### Text Matching Challenges | Challenge | Example | | :--- | :--- | | Word order | "Blue T-Shirt" vs "T-Shirt Blue" | | Abbreviations | "T-Shirt" vs "tshirt" | | Case differences | "BLUE" vs "blue" | | Synonyms | "large" vs "L" | | Extra words | "Size M" vs "M" | - ✅ Effective text comparison must account for word order, abbreviations, synonyms (exam answer) --- ## 10. String Normalization (Python) ```python import re def normalize_text(text): text = text.lower() # lowercase text = re.sub(r'[^\w\s]', '', text) # remove punctuation text = ' '.join(sorted(text.split())) # sort words alphabetically return text.strip() # Examples: # normalize_text("Microsoft Corp") -> "corp microsoft" # normalize_text("MICROSOFT CORP") -> "corp microsoft" (same!) ✅ ``` --- ## 11. pandas Data Cleaning Reference ```python # Detect issues df.isnull().sum() # missing values per column ✅ df.isnull().sum() / len(df) * 100 # percentage missing df.duplicated().sum() # count duplicates df.dtypes # check data types df['col'].value_counts() # see value distribution # Fix missing values df.dropna() # drop rows with ANY NaN df.dropna(subset=['review_text']) # drop where specific column is NaN ✅ df.dropna(how='all') # drop only if ALL values are NaN df.fillna(0) # fill with constant df['col'].fillna(df['col'].mean()) # fill with mean df['col'].fillna(df['col'].median()) # fill with median (robust to outliers) df.fillna(method='ffill') # forward fill (time series) df.fillna(method='bfill') # backward fill # Fix duplicates df.drop_duplicates() # all columns df.drop_duplicates(subset=['col']) # specific column ✅ # Fix data types df['col'].astype('float') df['col'].astype('str') df['col'].astype('category') # memory efficient for low-cardinality pd.to_datetime(df['date_col']) # parse dates pd.to_numeric(df['col'], errors='coerce') # force numeric, NaN on failure # Fix strings df['col'].str.lower() df['col'].str.strip() df['col'].str.replace('Inc.', 'Inc') df['col'].str.contains('pattern') # boolean mask for filtering ``` --- ## 12. Fill Strategy Selection | Strategy | When to Use | | :--- | :--- | | Drop rows | Text analysis, small percentage missing | | Fill mean | Normally distributed numeric data | | Fill median | Skewed data, outliers present ✅ | | Fill mode | Categorical data | | Forward fill | Time series, ordered data | --- ## 13. Quick Reference Card ``` OpenRefine: Purpose -> interactive data cleaning ✅ Clustering -> fingerprinting + n-gram ✅ GREL -> value.trim().toLowercase().toTitlecase() ✅ Faceting -> group and aggregate duplicates ✅ Operations -> export as JSON, replay on new datasets ✅ Most common data issue: Inconsistent naming/capitalization ✅ Excel leading zero fix: Text-to-Columns -> set column format to 'Text' ✅ Deduplication order: Normalize first -> then deduplicate ✅ Fuzzy matching for word order: fuzz.token_sort_ratio() ✅ Text matching challenges: Word order, abbreviations, synonyms, case ✅ ```