# VTT to Markdown Workflow

Complete end-to-end Python workflow for converting VTT transcript files into clean, filtered Markdown documents suitable for study and reference.

## Quick Start

```bash
# Single command to process all VTT files in a directory
python3 vtt_to_markdown.py /path/to/transcripts/ output.md

# Process specific files
python3 vtt_to_markdown.py file1.vtt file2.vtt file3.vtt output.md

# With custom title
python3 vtt_to_markdown.py --title "Module 1" transcripts/ Module1.md
```

## What It Does

This workflow performs four steps automatically:

1. **Parse VTT files** - Extracts speaker, content, and timestamps from WebVTT format
2. **Filter content** - Removes administrative remarks, fillers, and extraneous content
3. **Sort chronologically** - Orders sessions by date
4. **Generate Markdown** - Creates clean, readable output without timestamps

## Installation

No external dependencies required - uses only Python 3.8+ standard library.

```bash
chmod +x vtt_to_markdown.py
```

## Usage

### Basic Usage

```bash
python3 vtt_to_markdown.py input.vtt output.md
```

### Process Multiple Files

```bash
python3 vtt_to_markdown.py file1.vtt file2.vtt file3.vtt output.md
```

### Process Directory

Process all `.vtt` files in a directory:

```bash
python3 vtt_to_markdown.py /path/to/directory/ output.md
```

### Options

- `--title TITLE` - Custom document title (default: "Constitutional Law Transcript")
- `-q, --quiet` - Suppress progress messages

### Examples

```bash
# Process all transcripts with custom title
python3 vtt_to_markdown.py \
    /Users/Seth/Downloads/constitutional-law-2026/zoom-transcripts/ \
    Module1.md \
    --title "Constitutional Law - Module 1"

# Process specific files quietly
python3 vtt_to_markdown.py -q lecture1.vtt lecture2.vtt notes.md
```

## Output Format

The generated Markdown includes:

- **Document header** with title
- **Chronologically sorted sessions** (by date extracted from filename)
- **Section breaks** between different class sessions
- **Clean speaker-content format** (no timestamps)

### Example Output

```markdown
# Constitutional Law Transcript
*Extracted from Zoom class recordings*
---

## Session: January 20, 2026
*Source: GMT20260120-144421_Recording.transcript.vtt*

**Seth Chandler**: Hi, I'm Professor Chandler, this is Constitutional Law.

**Seth Chandler**: And, we're gonna get started...

---

## Session: January 21, 2026
*Source: GMT20260121-162431_Recording.transcript.vtt*

**Seth Chandler**: Let's continue with our discussion of Wickard v. Filburn...
```

## Filtering Logic

The workflow automatically removes:

### Administrative Content
- Attendance codes and announcements
- Office hours mentions
- Scheduling discussions
- Technical/Zoom issues

### Filler Content
- Single-word responses (yes, no, okay, etc.)
- Greetings without substance
- Navigation/computer interaction
- Incomplete fragments

### What's Preserved
- Case discussions and legal principles
- Questions and answers about constitutional concepts
- Teaching explanations (longer content)
- Substantive legal terms and concepts

The filter keeps entries containing keywords like:
- Case names (Wickard, Filburn, Lopez, Morrison, etc.)
- Legal concepts (Commerce Clause, Tenth Amendment, etc.)
- Constitutional terms (Congress, federal, state, statute, etc.)
- Longer explanatory content (>100 characters)

## Advanced Usage

### Using as a Python Module

```python
from vtt_to_markdown import process_vtt_to_markdown

process_vtt_to_markdown(
    input_paths=['/path/to/transcripts/'],
    output_file='output.md',
    title='My Custom Title',
    verbose=True
)
```

### Individual Components

The script is organized into reusable functions:

```python
# Step 1: Parse VTT
entries = parse_vtt_file('transcript.vtt', 'transcript.vtt')

# Step 2: Filter
filtered = filter_entries(entries)

# Step 3: Convert to Markdown
markdown = entries_to_markdown(filtered, title='My Title')

# Step 4: Save
with open('output.md', 'w') as f:
    f.write(markdown)
```

## Component Scripts

The workflow is also available as separate scripts:

1. **vtt_parser.py** - Parse VTT files to JSON
2. **filter_transcript.py** - Filter JSON to remove extraneous content
3. **create_markdown.py** - Convert JSON to Markdown

### Using Component Scripts

```bash
# Step-by-step workflow
python3 vtt_parser.py --aggregate transcripts/ raw.json
python3 filter_transcript.py raw.json filtered.json
python3 create_markdown.py filtered.json output.md

# Or use the all-in-one script
python3 vtt_to_markdown.py transcripts/ output.md
```

## File Size Expectations

For a typical constitutional law module with 6 class sessions:

- **Raw VTT files**: ~6 files, various sizes
- **Parsed entries**: ~2,900 entries
- **After filtering**: ~1,800 entries (60-65% retained)
- **Final Markdown**: ~320 KB, ~3,600 lines

## Customization

### Adjusting the Filter

Edit the `is_extraneous()` and `should_keep_entry()` functions to customize what gets filtered:

```python
def is_extraneous(content: str, speaker: str) -> bool:
    # Add your custom filtering logic
    if 'my_custom_keyword' in content.lower():
        return True
    return False
```

### Adding Substantive Keywords

Add domain-specific keywords to preserve relevant content:

```python
substantive_indicators = [
    'commerce clause', 'tenth amendment',
    # Add your keywords here
    'your_topic', 'your_case_name'
]
```

### Changing Output Format

Modify `entries_to_markdown()` to change the output structure:

```python
# Example: Add timestamps back
markdown_lines.append(f"**{speaker}** [{start_time}]: {content}\n\n")

# Example: Change section headers
markdown_lines.append(f"\n# Class Session - {date_str}\n")
```

## Troubleshooting

### No VTT files found
- Check file extensions (.vtt or .VTT)
- Verify directory path is correct

### Filtering too aggressive
- Adjust `substantive_indicators` list
- Modify length thresholds in `should_keep_entry()`

### Date extraction not working
- Ensure filenames follow GMT format: `GMT20260120-144421_Recording.transcript.vtt`
- Or modify `extract_date_from_filename()` for your format

## License

MIT License

## Author

Created for constitutional law transcript processing, adaptable for any lecture transcription workflow.
