Add PDF full-text search and diff denial #73

Merged
qwc merged 1 commit from feature/pdf-search-and-diff-denial into feature/pdf-upload-support 2026-02-16 18:29:36 +01:00
Owner

Summary

Follow-up to #72 (PDF upload support). Adds the two deferred features:

  • PDF text extraction for search: Tries pdftotext (poppler-utils) first for best quality, falls back to ledongthuc/pdf (pure Go, BSD-3-Clause) when pdftotext is not installed
  • poppler-utils in Docker image: Adds ~5MB to the runtime image for high-quality extraction; pure Go fallback ensures the binary works without it
  • Search indexing re-enabled for PDFs: Removes the contentType != "pdf" guard from both upload handlers
  • Graceful diff denial: Overlay JS detects PDF versions via content_type in the versions API and shows "Diff unavailable for PDF versions" instead of failing with "Could not find content area"
  • API enrichment: Versions API now returns content_type field

Test plan

  • Upload a PDF, verify it appears in search results
  • On a PDF version page, select compare version → see "Diff unavailable for PDF versions"
  • On an HTML version page, select a PDF compare target → same message
  • On an HTML version page, select another HTML version → normal diff works
  • Reindex all → PDF versions get indexed
  • go test ./... passes
  • Build succeeds

🤖 Generated with Claude Code

## Summary Follow-up to #72 (PDF upload support). Adds the two deferred features: - **PDF text extraction for search**: Tries `pdftotext` (poppler-utils) first for best quality, falls back to `ledongthuc/pdf` (pure Go, BSD-3-Clause) when `pdftotext` is not installed - **`poppler-utils` in Docker image**: Adds ~5MB to the runtime image for high-quality extraction; pure Go fallback ensures the binary works without it - **Search indexing re-enabled for PDFs**: Removes the `contentType != "pdf"` guard from both upload handlers - **Graceful diff denial**: Overlay JS detects PDF versions via `content_type` in the versions API and shows "Diff unavailable for PDF versions" instead of failing with "Could not find content area" - **API enrichment**: Versions API now returns `content_type` field ## Test plan - [ ] Upload a PDF, verify it appears in search results - [ ] On a PDF version page, select compare version → see "Diff unavailable for PDF versions" - [ ] On an HTML version page, select a PDF compare target → same message - [ ] On an HTML version page, select another HTML version → normal diff works - [ ] Reindex all → PDF versions get indexed - [ ] `go test ./...` passes - [ ] Build succeeds 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Extract text from PDFs for Bleve search indexing using pdftotext
  (poppler-utils) with pure Go fallback (ledongthuc/pdf)
- Add poppler-utils to Docker runtime image for best extraction quality
- Re-enable search indexing for PDF uploads (was skipped in #68)
- Add content_type to versions API so overlay JS knows which are PDFs
- Block diff comparison for PDF versions with clear error message
  instead of confusing "Could not find content area" failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qwc merged commit e5fdb8ca33 into feature/pdf-upload-support 2026-02-16 18:29:36 +01:00
qwc deleted branch feature/pdf-search-and-diff-denial 2026-02-16 18:29:36 +01:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
qwc-open/asiakirjat!73
No description provided.