Index PDF pages individually for page-level search results #86

Merged
qwc merged 1 commit from feature/pdf-search-page-jump into main 2026-02-20 08:31:20 +01:00
Owner

Summary

  • PDF text is now extracted per page (splitting on \f for pdftotext, iterating pages for Go fallback) instead of as a single concatenated blob
  • Each PDF page is indexed as a separate search document with a page_number field
  • Search results for PDFs include the page number and link to the PDF viewer with #page=N fragment
  • The PDF viewer reads the hash fragment and passes it to the embedded PDF for direct page jump
  • A search hint banner appears when arriving from search, prompting Ctrl+F to find the exact term
  • All three search UIs updated: full search page, overlay dropdown, navbar dropdown

Reindex required

Existing search indexes won't have per-page documents. A Rebuild Search Index from Admin > Projects is needed after deploying. Old indexes degrade gracefully (page_number returns 0, no page jump).

Closes #79

Test plan

  • Upload a multi-page PDF, rebuild search index
  • Search for text on page 3+ — result shows "Page N" and links to #page=N
  • Verify the PDF viewer opens at the correct page
  • Verify HTML search results still work with ?highlight=
  • Verify overlay and navbar search handle both PDF and HTML results
  • Build and tests pass

🤖 Generated with Claude Code

## Summary - PDF text is now extracted per page (splitting on `\f` for pdftotext, iterating pages for Go fallback) instead of as a single concatenated blob - Each PDF page is indexed as a separate search document with a `page_number` field - Search results for PDFs include the page number and link to the PDF viewer with `#page=N` fragment - The PDF viewer reads the hash fragment and passes it to the embedded PDF for direct page jump - A search hint banner appears when arriving from search, prompting Ctrl+F to find the exact term - All three search UIs updated: full search page, overlay dropdown, navbar dropdown ## Reindex required Existing search indexes won't have per-page documents. A **Rebuild Search Index** from Admin > Projects is needed after deploying. Old indexes degrade gracefully (page_number returns 0, no page jump). Closes #79 ## Test plan - [ ] Upload a multi-page PDF, rebuild search index - [ ] Search for text on page 3+ — result shows "Page N" and links to `#page=N` - [ ] Verify the PDF viewer opens at the correct page - [ ] Verify HTML search results still work with `?highlight=` - [ ] Verify overlay and navbar search handle both PDF and HTML results - [ ] Build and tests pass 🤖 Generated with [Claude Code](https://claude.ai/code)
Index PDF pages individually for page-level search results
All checks were successful
CI / build (pull_request) Successful in 40s
CI / docker (pull_request) Has been skipped
CI / test (pull_request) Successful in 54s
c5503a147d
PDF text is now extracted per page instead of as a single blob.
Search results for PDFs include the page number and link directly
to the matching page using #page=N fragments. The PDF viewer
reads the fragment and passes it to the embedded PDF, and shows
a search hint banner when arriving from a search result.

Requires a search index rebuild after deploying.

Closes #79

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qwc merged commit b1bd483a40 into main 2026-02-20 08:31:20 +01:00
qwc deleted branch feature/pdf-search-page-jump 2026-02-20 08:31:20 +01:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
qwc-open/asiakirjat!86
No description provided.