feat: two-phase indexation with direct thumbnail generation in indexer

Phase 1 (discovery): walkdir + filename-only metadata, zero archive I/O.
Books are visible immediately in the UI while Phase 2 runs in background.

Phase 2 (analysis): open each archive once via analyze_book() to extract
page_count and first page bytes, then generate WebP thumbnail directly in
the indexer — removing the HTTP roundtrip to the API checkup endpoint.

- Add parse_metadata_fast() (infallible, no archive I/O)
- Add analyze_book() returning (page_count, first_page_bytes) in one pass
- Add looks_like_image() magic bytes check for unrar p stdout validation
- Add lsar fallback in list_cbr_images() for UTF-16BE encoded filenames
- Add directory_mtimes table to skip unchanged dirs on incremental scans
- Add analyzer.rs: generate_thumbnail, analyze_library_books, regenerate_thumbnails
- Remove run_checkup() from API; indexer handles thumbnail jobs directly
- Remove api_base_url/api_bootstrap_token from IndexerConfig and AppState
- Add unar + poppler-utils to indexer Dockerfile
- Fix smoke.sh: wait for job completion, check thumbnail_url field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-09 22:13:05 +01:00
parent 36af34443e
commit cfc896e92f
22 changed files with 1274 additions and 768 deletions

View File

@@ -7,15 +7,16 @@ Service background sur le port **7081**. Voir `AGENTS.md` racine pour les conven
| Fichier | Rôle |
|---------|------|
| `main.rs` | Point d'entrée, initialisation, lancement du worker |
| `lib.rs` | `AppState` (pool, meili, api_base_url) |
| `lib.rs` | `AppState` (pool, meili_url, meili_master_key) |
| `worker.rs` | Boucle principale : claim job → process → cleanup stale |
| `job.rs` | `claim_next_job`, `process_job`, `fail_job`, `cleanup_stale_jobs` |
| `scanner.rs` | Scan filesystem, parsing parallèle (rayon), batching DB |
| `scanner.rs` | Phase 1 discovery : WalkDir + `parse_metadata_fast` (zéro I/O archive), skip dossiers inchangés via mtime, batching DB |
| `analyzer.rs` | Phase 2 analysis : ouvre chaque archive une fois (`analyze_book`), génère page_count + thumbnail WebP |
| `batch.rs` | `flush_all_batches` avec UNNEST, structures `BookInsert/Update/FileInsert/Update/ErrorInsert` |
| `scheduler.rs` | Auto-scan : vérifie toutes les 60s les bibliothèques à monitorer |
| `watcher.rs` | File watcher temps réel |
| `meili.rs` | Indexation/sync Meilisearch |
| `api.rs` | Appels HTTP vers l'API (pour checkup thumbnails) |
| `api.rs` | Endpoints HTTP de l'indexer (/health, /ready) |
| `utils.rs` | `remap_libraries_path`, `unmap_libraries_path`, `compute_fingerprint`, `kind_from_format` |
## Cycle de vie d'un job
@@ -23,10 +24,21 @@ Service background sur le port **7081**. Voir `AGENTS.md` racine pour les conven
```
claim_next_job (UPDATE ... RETURNING, status pending→running)
└─ process_job
├─ scanner::scan_library (rayon par_iter pour le parsing)
flush_all_batches toutes les BATCH_SIZE=100 itérations
└─ meili sync
└─ api checkup thumbnails (POST /index/jobs/:id/thumbnails/checkup)
├─ Phase 1 : scanner::scan_library_discovery
WalkDir + parse_metadata_fast (zéro I/O archive)
│ ├─ skip dossiers via directory_mtimes (table DB)
└─ INSERT books (page_count=NULL) → livres visibles immédiatement
├─ meili::sync_meili
├─ analyzer::cleanup_orphaned_thumbnails (full_rebuild uniquement)
└─ Phase 2 : analyzer::analyze_library_books
├─ SELECT books WHERE page_count IS NULL
├─ parsers::analyze_book → (page_count, first_page_bytes)
├─ generate_thumbnail (WebP, Lanczos3)
└─ UPDATE books SET page_count, thumbnail_path
Jobs spéciaux :
thumbnail_rebuild → analyze_library_books(thumbnail_only=true)
thumbnail_regenerate → regenerate_thumbnails (clear + re-analyze)
```
- Annulation : `is_job_cancelled` vérifié toutes les 10 fichiers ou 1s — retourne `Err("Job cancelled")`
@@ -49,14 +61,28 @@ if books_to_insert.len() >= BATCH_SIZE {
Toutes les opérations du flush sont dans une seule transaction.
## Scan filesystem (scanner.rs)
## Scan filesystem — architecture 2 phases
Pipeline en 3 étapes :
1. **Collect** : WalkDir → filtrer par format (CBZ/CBR/PDF)
2. **Parse** : `file_infos.into_par_iter().map(parse_metadata)` (rayon)
3. **Process** : séquentiel pour les inserts/updates DB
### Phase 1 : Discovery (`scanner.rs`)
Fingerprint = SHA256(taille + mtime) pour détecter les changements sans relire le fichier.
Pipeline allégé — **zéro ouverture d'archive** :
1. Charger `directory_mtimes` depuis la DB
2. WalkDir : pour chaque dossier, comparer mtime filesystem vs mtime stocké → skip si inchangé
3. Pour chaque fichier : `parse_metadata_fast` (title/series/volume depuis filename uniquement)
4. INSERT/UPDATE avec `page_count = NULL` — les livres sont visibles immédiatement
5. Upsert `directory_mtimes` en fin de scan
Fingerprint = SHA256(taille + mtime + filename) pour détecter les changements sans relire le fichier.
### Phase 2 : Analysis (`analyzer.rs`)
Traitement progressif en background :
- Query `WHERE page_count IS NULL` (ou `thumbnail_path IS NULL` pour thumbnail jobs)
- Concurrence bornée (`futures::stream::for_each_concurrent`, défaut 4)
- Par livre : `parsers::analyze_book(path, format)``(page_count, first_page_bytes)`
- Génération thumbnail : resize Lanczos3 + encode WebP
- UPDATE `books SET page_count, thumbnail_path`
- Config lue depuis `app_settings` (clés `'thumbnail'` et `'limits'`)
## Path remapping
@@ -69,7 +95,10 @@ utils::unmap_libraries_path(&local_path) // filesystem local → DB
## Gotchas
- **Thumbnails** : générés par l'API après handoff, pas par l'indexer directement. L'indexer appelle `/index/jobs/:id/thumbnails/checkup` via `api.rs`.
- **full_rebuild** : si `true`, ignore les fingerprints → tous les fichiers sont retraités.
- **Thumbnails** : générés **directement par l'indexer** (phase 2, `analyzer.rs`). L'API ne gère plus la génération — elle crée juste les jobs en DB.
- **page_count = NULL** : après la phase discovery, tous les nouveaux livres ont `page_count = NULL`. La phase analysis les remplit progressivement. Ne pas confondre avec une erreur.
- **directory_mtimes** : table DB qui stocke le mtime de chaque dossier scanné. Vidée au full_rebuild, mise à jour après chaque scan. Permet de skipper les dossiers inchangés en scan incrémental.
- **full_rebuild** : supprime toutes les données puis re-insère. Ignore les fingerprints et les directory_mtimes.
- **Annulation** : vérifier `is_job_cancelled` régulièrement pour respecter les annulations utilisateur.
- **Watcher + scheduler** : tournent en tâches tokio séparées dans `worker.rs`, en parallèle de la boucle principale.
- **spawn_blocking** : l'ouverture d'archive (`analyze_book`) et la génération de thumbnail sont des opérations bloquantes — toujours les wrapper dans `tokio::task::spawn_blocking`.

View File

@@ -10,6 +10,8 @@ license.workspace = true
anyhow.workspace = true
axum.workspace = true
chrono.workspace = true
futures = "0.3"
image.workspace = true
notify = "6.1"
parsers = { path = "../../crates/parsers" }
rand.workspace = true
@@ -25,3 +27,4 @@ tracing.workspace = true
tracing-subscriber.workspace = true
uuid.workspace = true
walkdir.workspace = true
webp.workspace = true

View File

@@ -21,7 +21,11 @@ RUN --mount=type=cache,target=/sccache \
cargo build --release -p indexer
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates wget unrar-free && rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates wget \
unrar-free unar \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/indexer /usr/local/bin/indexer
EXPOSE 7081
CMD ["/usr/local/bin/indexer"]

View File

@@ -0,0 +1,442 @@
use anyhow::Result;
use futures::stream::{self, StreamExt};
use image::GenericImageView;
use parsers::{analyze_book, BookFormat};
use sqlx::Row;
use std::path::Path;
use std::sync::atomic::{AtomicI32, Ordering};
use std::sync::Arc;
use tracing::{info, warn};
use uuid::Uuid;
use crate::{utils, AppState};
#[derive(Clone)]
struct ThumbnailConfig {
enabled: bool,
width: u32,
height: u32,
quality: u8,
directory: String,
}
async fn load_thumbnail_config(pool: &sqlx::PgPool) -> ThumbnailConfig {
let fallback = ThumbnailConfig {
enabled: true,
width: 300,
height: 400,
quality: 80,
directory: "/data/thumbnails".to_string(),
};
let row = sqlx::query(r#"SELECT value FROM app_settings WHERE key = 'thumbnail'"#)
.fetch_optional(pool)
.await;
match row {
Ok(Some(row)) => {
let value: serde_json::Value = row.get("value");
ThumbnailConfig {
enabled: value
.get("enabled")
.and_then(|v| v.as_bool())
.unwrap_or(fallback.enabled),
width: value
.get("width")
.and_then(|v| v.as_u64())
.map(|v| v as u32)
.unwrap_or(fallback.width),
height: value
.get("height")
.and_then(|v| v.as_u64())
.map(|v| v as u32)
.unwrap_or(fallback.height),
quality: value
.get("quality")
.and_then(|v| v.as_u64())
.map(|v| v as u8)
.unwrap_or(fallback.quality),
directory: value
.get("directory")
.and_then(|v| v.as_str())
.map(|s| s.to_string())
.unwrap_or_else(|| fallback.directory.clone()),
}
}
_ => fallback,
}
}
async fn load_thumbnail_concurrency(pool: &sqlx::PgPool) -> usize {
let default_concurrency = 4;
let row = sqlx::query(r#"SELECT value FROM app_settings WHERE key = 'limits'"#)
.fetch_optional(pool)
.await;
match row {
Ok(Some(row)) => {
let value: serde_json::Value = row.get("value");
value
.get("concurrent_renders")
.and_then(|v| v.as_u64())
.map(|v| v as usize)
.unwrap_or(default_concurrency)
}
_ => default_concurrency,
}
}
fn generate_thumbnail(image_bytes: &[u8], config: &ThumbnailConfig) -> anyhow::Result<Vec<u8>> {
let img = image::load_from_memory(image_bytes)
.map_err(|e| anyhow::anyhow!("failed to load image: {}", e))?;
let (orig_w, orig_h) = img.dimensions();
let ratio_w = config.width as f32 / orig_w as f32;
let ratio_h = config.height as f32 / orig_h as f32;
let ratio = ratio_w.min(ratio_h);
let new_w = (orig_w as f32 * ratio) as u32;
let new_h = (orig_h as f32 * ratio) as u32;
let resized = img.resize(new_w, new_h, image::imageops::FilterType::Lanczos3);
let rgba = resized.to_rgba8();
let (w, h) = rgba.dimensions();
let rgb_data: Vec<u8> = rgba.pixels().flat_map(|p| [p[0], p[1], p[2]]).collect();
let quality = f32::max(config.quality as f32, 85.0);
let webp_data = webp::Encoder::new(&rgb_data, webp::PixelLayout::Rgb, w, h).encode(quality);
Ok(webp_data.to_vec())
}
fn save_thumbnail(
book_id: Uuid,
thumbnail_bytes: &[u8],
config: &ThumbnailConfig,
) -> anyhow::Result<String> {
let dir = Path::new(&config.directory);
std::fs::create_dir_all(dir)?;
let filename = format!("{}.webp", book_id);
let path = dir.join(&filename);
std::fs::write(&path, thumbnail_bytes)?;
Ok(path.to_string_lossy().to_string())
}
fn book_format_from_str(s: &str) -> Option<BookFormat> {
match s {
"cbz" => Some(BookFormat::Cbz),
"cbr" => Some(BookFormat::Cbr),
"pdf" => Some(BookFormat::Pdf),
_ => None,
}
}
/// Phase 2 — Analysis: open each unanalyzed archive once, extract page_count + thumbnail.
/// `thumbnail_only` = true: only process books missing thumbnail (page_count may already be set).
/// `thumbnail_only` = false: process books missing page_count.
pub async fn analyze_library_books(
state: &AppState,
job_id: Uuid,
library_id: Option<Uuid>,
thumbnail_only: bool,
) -> Result<()> {
let config = load_thumbnail_config(&state.pool).await;
if !config.enabled {
info!("[ANALYZER] Thumbnails disabled, skipping analysis phase");
return Ok(());
}
let concurrency = load_thumbnail_concurrency(&state.pool).await;
// Query books that need analysis
let query_filter = if thumbnail_only {
"b.thumbnail_path IS NULL"
} else {
"b.page_count IS NULL"
};
let sql = format!(
r#"
SELECT b.id AS book_id, bf.abs_path, bf.format
FROM books b
JOIN book_files bf ON bf.book_id = b.id
WHERE (b.library_id = $1 OR $1 IS NULL)
AND {}
"#,
query_filter
);
let rows = sqlx::query(&sql)
.bind(library_id)
.fetch_all(&state.pool)
.await?;
if rows.is_empty() {
info!("[ANALYZER] No books to analyze");
return Ok(());
}
let total = rows.len() as i32;
info!(
"[ANALYZER] Analyzing {} books (thumbnail_only={}, concurrency={})",
total, thumbnail_only, concurrency
);
// Update job status
let _ = sqlx::query(
"UPDATE index_jobs SET status = 'generating_thumbnails', total_files = $2, processed_files = 0, current_file = NULL WHERE id = $1",
)
.bind(job_id)
.bind(total)
.execute(&state.pool)
.await;
let processed_count = Arc::new(AtomicI32::new(0));
struct BookTask {
book_id: Uuid,
abs_path: String,
format: String,
}
let tasks: Vec<BookTask> = rows
.into_iter()
.map(|row| BookTask {
book_id: row.get("book_id"),
abs_path: row.get("abs_path"),
format: row.get("format"),
})
.collect();
stream::iter(tasks)
.for_each_concurrent(concurrency, |task| {
let processed_count = processed_count.clone();
let pool = state.pool.clone();
let config = config.clone();
async move {
let local_path = utils::remap_libraries_path(&task.abs_path);
let path = Path::new(&local_path);
let format = match book_format_from_str(&task.format) {
Some(f) => f,
None => {
warn!("[ANALYZER] Unknown format '{}' for book {}", task.format, task.book_id);
return;
}
};
// Run blocking archive I/O on a thread pool
let book_id = task.book_id;
let path_owned = path.to_path_buf();
let analyze_result = tokio::task::spawn_blocking(move || {
analyze_book(&path_owned, format)
})
.await;
let (page_count, image_bytes) = match analyze_result {
Ok(Ok(result)) => result,
Ok(Err(e)) => {
warn!("[ANALYZER] analyze_book failed for book {}: {}", book_id, e);
// Mark parse_status = error in book_files
let _ = sqlx::query(
"UPDATE book_files SET parse_status = 'error', parse_error_opt = $2 WHERE book_id = $1",
)
.bind(book_id)
.bind(e.to_string())
.execute(&pool)
.await;
return;
}
Err(e) => {
warn!("[ANALYZER] spawn_blocking error for book {}: {}", book_id, e);
return;
}
};
// Generate thumbnail
let thumb_result = tokio::task::spawn_blocking({
let config = config.clone();
move || generate_thumbnail(&image_bytes, &config)
})
.await;
let thumb_bytes = match thumb_result {
Ok(Ok(b)) => b,
Ok(Err(e)) => {
warn!("[ANALYZER] thumbnail generation failed for book {}: {}", book_id, e);
// Still update page_count even if thumbnail fails
let _ = sqlx::query(
"UPDATE books SET page_count = $1 WHERE id = $2",
)
.bind(page_count)
.bind(book_id)
.execute(&pool)
.await;
return;
}
Err(e) => {
warn!("[ANALYZER] spawn_blocking thumbnail error for book {}: {}", book_id, e);
return;
}
};
// Save thumbnail file
let save_result = {
let config = config.clone();
tokio::task::spawn_blocking(move || save_thumbnail(book_id, &thumb_bytes, &config))
.await
};
let thumb_path = match save_result {
Ok(Ok(p)) => p,
Ok(Err(e)) => {
warn!("[ANALYZER] save_thumbnail failed for book {}: {}", book_id, e);
let _ = sqlx::query("UPDATE books SET page_count = $1 WHERE id = $2")
.bind(page_count)
.bind(book_id)
.execute(&pool)
.await;
return;
}
Err(e) => {
warn!("[ANALYZER] spawn_blocking save error for book {}: {}", book_id, e);
return;
}
};
// Update DB
if let Err(e) = sqlx::query(
"UPDATE books SET page_count = $1, thumbnail_path = $2 WHERE id = $3",
)
.bind(page_count)
.bind(&thumb_path)
.bind(book_id)
.execute(&pool)
.await
{
warn!("[ANALYZER] DB update failed for book {}: {}", book_id, e);
return;
}
let processed = processed_count.fetch_add(1, Ordering::Relaxed) + 1;
let percent = (processed as f64 / total as f64 * 100.0) as i32;
let _ = sqlx::query(
"UPDATE index_jobs SET processed_files = $2, progress_percent = $3 WHERE id = $1",
)
.bind(job_id)
.bind(processed)
.bind(percent)
.execute(&pool)
.await;
}
})
.await;
let final_count = processed_count.load(Ordering::Relaxed);
info!(
"[ANALYZER] Analysis complete: {}/{} books processed",
final_count, total
);
Ok(())
}
/// Clear thumbnail files and DB references for books in scope, then re-analyze.
pub async fn regenerate_thumbnails(
state: &AppState,
job_id: Uuid,
library_id: Option<Uuid>,
) -> Result<()> {
let config = load_thumbnail_config(&state.pool).await;
// Delete thumbnail files for all books in scope
let book_ids_to_clear: Vec<Uuid> = sqlx::query_scalar(
r#"SELECT id FROM books WHERE (library_id = $1 OR $1 IS NULL) AND thumbnail_path IS NOT NULL"#,
)
.bind(library_id)
.fetch_all(&state.pool)
.await
.unwrap_or_default();
let mut deleted_count = 0usize;
for book_id in &book_ids_to_clear {
let filename = format!("{}.webp", book_id);
let thumbnail_path = Path::new(&config.directory).join(&filename);
if thumbnail_path.exists() {
if let Err(e) = std::fs::remove_file(&thumbnail_path) {
warn!(
"[ANALYZER] Failed to delete thumbnail {}: {}",
thumbnail_path.display(),
e
);
} else {
deleted_count += 1;
}
}
}
info!(
"[ANALYZER] Deleted {} thumbnail files for regeneration",
deleted_count
);
// Clear thumbnail_path in DB
sqlx::query(
r#"UPDATE books SET thumbnail_path = NULL WHERE (library_id = $1 OR $1 IS NULL)"#,
)
.bind(library_id)
.execute(&state.pool)
.await?;
// Re-analyze all books (now thumbnail_path IS NULL for all)
analyze_library_books(state, job_id, library_id, true).await
}
/// Delete orphaned thumbnail files (books deleted in full_rebuild get new UUIDs).
pub async fn cleanup_orphaned_thumbnails(
state: &AppState,
library_id: Option<Uuid>,
) -> Result<()> {
let config = load_thumbnail_config(&state.pool).await;
let existing_book_ids: std::collections::HashSet<Uuid> = sqlx::query_scalar(
r#"SELECT id FROM books WHERE (library_id = $1 OR $1 IS NULL)"#,
)
.bind(library_id)
.fetch_all(&state.pool)
.await
.unwrap_or_default()
.into_iter()
.collect();
let thumbnail_dir = Path::new(&config.directory);
if !thumbnail_dir.exists() {
return Ok(());
}
let mut deleted_count = 0usize;
if let Ok(entries) = std::fs::read_dir(thumbnail_dir) {
for entry in entries.flatten() {
if let Some(file_name) = entry.file_name().to_str() {
if file_name.ends_with(".webp") {
if let Some(book_id_str) = file_name.strip_suffix(".webp") {
if let Ok(book_id) = Uuid::parse_str(book_id_str) {
if !existing_book_ids.contains(&book_id) {
if let Err(e) = std::fs::remove_file(entry.path()) {
warn!(
"Failed to delete orphaned thumbnail {}: {}",
entry.path().display(),
e
);
} else {
deleted_count += 1;
}
}
}
}
}
}
}
}
info!(
"[ANALYZER] Deleted {} orphaned thumbnail files",
deleted_count
);
Ok(())
}

View File

@@ -1,63 +1,60 @@
use anyhow::Result;
use rayon::prelude::*;
use sqlx::{PgPool, Row};
use std::time::Duration;
use tracing::{error, info};
use uuid::Uuid;
use crate::{meili, scanner, AppState};
use crate::{analyzer, meili, scanner, AppState};
pub async fn cleanup_stale_jobs(pool: &PgPool) -> Result<()> {
// Mark jobs that have been running for more than 5 minutes as failed
// This handles cases where the indexer was restarted while jobs were running
let result = sqlx::query(
r#"
UPDATE index_jobs
SET status = 'failed',
finished_at = NOW(),
UPDATE index_jobs
SET status = 'failed',
finished_at = NOW(),
error_opt = 'Job interrupted by indexer restart'
WHERE status = 'running'
WHERE status = 'running'
AND started_at < NOW() - INTERVAL '5 minutes'
RETURNING id
"#
"#,
)
.fetch_all(pool)
.await?;
if !result.is_empty() {
let count = result.len();
let ids: Vec<String> = result.iter()
let ids: Vec<String> = result
.iter()
.map(|row| row.get::<Uuid, _>("id").to_string())
.collect();
info!("[CLEANUP] Marked {} stale job(s) as failed: {}", count, ids.join(", "));
info!(
"[CLEANUP] Marked {} stale job(s) as failed: {}",
count,
ids.join(", ")
);
}
Ok(())
}
pub async fn claim_next_job(pool: &PgPool) -> Result<Option<(Uuid, Option<Uuid>)>> {
let mut tx = pool.begin().await?;
// Atomically select and lock the next job
// Exclude rebuild/full_rebuild if one is already running
// Prioritize: full_rebuild > rebuild > others
let row = sqlx::query(
r#"
SELECT j.id, j.type, j.library_id
FROM index_jobs j
WHERE j.status = 'pending'
AND (
-- Allow rebuilds only if no rebuild is running
(j.type IN ('rebuild', 'full_rebuild') AND NOT EXISTS (
SELECT 1 FROM index_jobs
WHERE status = 'running'
AND type IN ('rebuild', 'full_rebuild')
))
OR
-- Always allow non-rebuild jobs
j.type NOT IN ('rebuild', 'full_rebuild')
)
ORDER BY
ORDER BY
CASE j.type
WHEN 'full_rebuild' THEN 1
WHEN 'rebuild' THEN 2
@@ -66,7 +63,7 @@ pub async fn claim_next_job(pool: &PgPool) -> Result<Option<(Uuid, Option<Uuid>)
j.created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
"#
"#,
)
.fetch_optional(&mut *tx)
.await?;
@@ -79,8 +76,7 @@ pub async fn claim_next_job(pool: &PgPool) -> Result<Option<(Uuid, Option<Uuid>)
let id: Uuid = row.get("id");
let job_type: String = row.get("type");
let library_id: Option<Uuid> = row.get("library_id");
// Final check: if this is a rebuild, ensure no rebuild started between SELECT and UPDATE
if job_type == "rebuild" || job_type == "full_rebuild" {
let has_running_rebuild: bool = sqlx::query_scalar(
r#"
@@ -90,48 +86,55 @@ pub async fn claim_next_job(pool: &PgPool) -> Result<Option<(Uuid, Option<Uuid>)
AND type IN ('rebuild', 'full_rebuild')
AND id != $1
)
"#
"#,
)
.bind(id)
.fetch_one(&mut *tx)
.await?;
if has_running_rebuild {
tx.rollback().await?;
return Ok(None);
}
}
sqlx::query("UPDATE index_jobs SET status = 'running', started_at = NOW(), error_opt = NULL WHERE id = $1")
.bind(id)
.execute(&mut *tx)
.await?;
sqlx::query(
"UPDATE index_jobs SET status = 'running', started_at = NOW(), error_opt = NULL WHERE id = $1",
)
.bind(id)
.execute(&mut *tx)
.await?;
tx.commit().await?;
Ok(Some((id, library_id)))
}
pub async fn fail_job(pool: &PgPool, job_id: Uuid, error_message: &str) -> Result<()> {
sqlx::query("UPDATE index_jobs SET status = 'failed', finished_at = NOW(), error_opt = $2 WHERE id = $1")
.bind(job_id)
.bind(error_message)
.execute(pool)
.await?;
sqlx::query(
"UPDATE index_jobs SET status = 'failed', finished_at = NOW(), error_opt = $2 WHERE id = $1",
)
.bind(job_id)
.bind(error_message)
.execute(pool)
.await?;
Ok(())
}
pub async fn is_job_cancelled(pool: &PgPool, job_id: Uuid) -> Result<bool> {
let status: Option<String> = sqlx::query_scalar(
"SELECT status FROM index_jobs WHERE id = $1"
)
.bind(job_id)
.fetch_optional(pool)
.await?;
let status: Option<String> =
sqlx::query_scalar("SELECT status FROM index_jobs WHERE id = $1")
.bind(job_id)
.fetch_optional(pool)
.await?;
Ok(status.as_deref() == Some("cancelled"))
}
pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Option<Uuid>) -> Result<()> {
pub async fn process_job(
state: &AppState,
job_id: Uuid,
target_library_id: Option<Uuid>,
) -> Result<()> {
info!("[JOB] Processing {} library={:?}", job_id, target_library_id);
let job_type: String = sqlx::query_scalar("SELECT type FROM index_jobs WHERE id = $1")
@@ -139,8 +142,8 @@ pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Opti
.fetch_one(&state.pool)
.await?;
// Thumbnail jobs: hand off to API and wait for completion (same queue as rebuilds)
if job_type == "thumbnail_rebuild" || job_type == "thumbnail_regenerate" {
// Thumbnail rebuild: generate thumbnails for books missing them
if job_type == "thumbnail_rebuild" {
sqlx::query(
"UPDATE index_jobs SET status = 'generating_thumbnails', started_at = NOW() WHERE id = $1",
)
@@ -148,54 +151,65 @@ pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Opti
.execute(&state.pool)
.await?;
let api_base = state.api_base_url.trim_end_matches('/');
let url = format!("{}/index/jobs/{}/thumbnails/checkup", api_base, job_id);
let client = reqwest::Client::new();
let res = client
.post(&url)
.header("Authorization", format!("Bearer {}", state.api_bootstrap_token))
.send()
.await?;
if !res.status().is_success() {
anyhow::bail!("thumbnail checkup API returned {}", res.status());
}
analyzer::analyze_library_books(state, job_id, target_library_id, true).await?;
// Poll until job is finished (API updates the same row)
let poll_interval = Duration::from_secs(1);
loop {
tokio::time::sleep(poll_interval).await;
let status: String = sqlx::query_scalar("SELECT status FROM index_jobs WHERE id = $1")
.bind(job_id)
.fetch_one(&state.pool)
.await?;
if status == "success" || status == "failed" {
info!("[JOB] Thumbnail job {} finished with status {}", job_id, status);
return Ok(());
}
}
sqlx::query(
"UPDATE index_jobs SET status = 'success', finished_at = NOW(), progress_percent = 100, current_file = NULL WHERE id = $1",
)
.bind(job_id)
.execute(&state.pool)
.await?;
return Ok(());
}
// Thumbnail regenerate: clear all thumbnails then re-generate
if job_type == "thumbnail_regenerate" {
sqlx::query(
"UPDATE index_jobs SET status = 'generating_thumbnails', started_at = NOW() WHERE id = $1",
)
.bind(job_id)
.execute(&state.pool)
.await?;
analyzer::regenerate_thumbnails(state, job_id, target_library_id).await?;
sqlx::query(
"UPDATE index_jobs SET status = 'success', finished_at = NOW(), progress_percent = 100, current_file = NULL WHERE id = $1",
)
.bind(job_id)
.execute(&state.pool)
.await?;
return Ok(());
}
let is_full_rebuild = job_type == "full_rebuild";
info!("[JOB] {} type={} full_rebuild={}", job_id, job_type, is_full_rebuild);
info!(
"[JOB] {} type={} full_rebuild={}",
job_id, job_type, is_full_rebuild
);
// For full rebuilds, delete existing data first
// Full rebuild: delete existing data first
if is_full_rebuild {
info!("[JOB] Full rebuild: deleting existing data");
if let Some(library_id) = target_library_id {
// Delete books and files for specific library
sqlx::query("DELETE FROM book_files WHERE book_id IN (SELECT id FROM books WHERE library_id = $1)")
.bind(library_id)
.execute(&state.pool)
.await?;
sqlx::query(
"DELETE FROM book_files WHERE book_id IN (SELECT id FROM books WHERE library_id = $1)",
)
.bind(library_id)
.execute(&state.pool)
.await?;
sqlx::query("DELETE FROM books WHERE library_id = $1")
.bind(library_id)
.execute(&state.pool)
.await?;
info!("[JOB] Deleted existing data for library {}", library_id);
} else {
// Delete all books and files
sqlx::query("DELETE FROM book_files").execute(&state.pool).await?;
sqlx::query("DELETE FROM book_files")
.execute(&state.pool)
.await?;
sqlx::query("DELETE FROM books").execute(&state.pool).await?;
info!("[JOB] Deleted all existing data");
}
@@ -212,24 +226,34 @@ pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Opti
.await?
};
// First pass: count total files for progress estimation (parallel)
let library_paths: Vec<String> = libraries.iter()
.map(|library| crate::utils::remap_libraries_path(&library.get::<String, _>("root_path")))
// Count total files for progress estimation
let library_paths: Vec<String> = libraries
.iter()
.map(|library| {
crate::utils::remap_libraries_path(&library.get::<String, _>("root_path"))
})
.collect();
let total_files: usize = library_paths.par_iter()
let total_files: usize = library_paths
.par_iter()
.map(|root_path| {
walkdir::WalkDir::new(root_path)
.into_iter()
.filter_map(Result::ok)
.filter(|entry| entry.file_type().is_file() && parsers::detect_format(entry.path()).is_some())
.filter(|entry| {
entry.file_type().is_file()
&& parsers::detect_format(entry.path()).is_some()
})
.count()
})
.sum();
info!("[JOB] Found {} libraries, {} total files to index", libraries.len(), total_files);
// Update job with total estimate
info!(
"[JOB] Found {} libraries, {} total files to index",
libraries.len(),
total_files
);
sqlx::query("UPDATE index_jobs SET total_files = $2 WHERE id = $1")
.bind(job_id)
.bind(total_files as i32)
@@ -242,26 +266,47 @@ pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Opti
removed_files: 0,
errors: 0,
};
// Track processed files across all libraries for accurate progress
let mut total_processed_count = 0i32;
for library in libraries {
// Phase 1: Discovery
for library in &libraries {
let library_id: Uuid = library.get("id");
let root_path: String = library.get("root_path");
let root_path = crate::utils::remap_libraries_path(&root_path);
match scanner::scan_library(state, job_id, library_id, std::path::Path::new(&root_path), &mut stats, &mut total_processed_count, total_files, is_full_rebuild).await {
match scanner::scan_library_discovery(
state,
job_id,
library_id,
std::path::Path::new(&root_path),
&mut stats,
&mut total_processed_count,
total_files,
is_full_rebuild,
)
.await
{
Ok(()) => {}
Err(err) => {
let err_str = err.to_string();
if err_str.contains("cancelled") || err_str.contains("Cancelled") {
return Err(err);
}
stats.errors += 1;
error!(library_id = %library_id, error = %err, "library scan failed");
}
}
}
// Sync search index after discovery (books are visible immediately)
meili::sync_meili(&state.pool, &state.meili_url, &state.meili_master_key).await?;
// Hand off to API for thumbnail checkup (API will set status = 'success' when done)
// For full rebuild: clean up orphaned thumbnail files (old UUIDs)
if is_full_rebuild {
analyzer::cleanup_orphaned_thumbnails(state, target_library_id).await?;
}
// Phase 2: Analysis (extract page_count + thumbnails for new/updated books)
sqlx::query(
"UPDATE index_jobs SET status = 'generating_thumbnails', stats_json = $2, current_file = NULL, processed_files = $3 WHERE id = $1",
)
@@ -271,23 +316,14 @@ pub async fn process_job(state: &AppState, job_id: Uuid, target_library_id: Opti
.execute(&state.pool)
.await?;
let api_base = state.api_base_url.trim_end_matches('/');
let url = format!("{}/index/jobs/{}/thumbnails/checkup", api_base, job_id);
let client = reqwest::Client::new();
let res = client
.post(&url)
.header("Authorization", format!("Bearer {}", state.api_bootstrap_token))
.send()
.await;
if let Err(e) = res {
tracing::warn!("[JOB] Failed to trigger thumbnail checkup: {} — API will not generate thumbnails for this job", e);
} else if let Ok(r) = res {
if !r.status().is_success() {
tracing::warn!("[JOB] Thumbnail checkup returned {} — API may not generate thumbnails", r.status());
} else {
info!("[JOB] Thumbnail checkup started (job {}), API will complete the job", job_id);
}
}
analyzer::analyze_library_books(state, job_id, target_library_id, false).await?;
sqlx::query(
"UPDATE index_jobs SET status = 'success', finished_at = NOW(), progress_percent = 100, current_file = NULL WHERE id = $1",
)
.bind(job_id)
.execute(&state.pool)
.await?;
Ok(())
}

View File

@@ -1,3 +1,4 @@
pub mod analyzer;
pub mod api;
pub mod batch;
pub mod job;
@@ -15,6 +16,4 @@ pub struct AppState {
pub pool: PgPool,
pub meili_url: String,
pub meili_master_key: String,
pub api_base_url: String,
pub api_bootstrap_token: String,
}

View File

@@ -22,8 +22,6 @@ async fn main() -> anyhow::Result<()> {
pool,
meili_url: config.meili_url.clone(),
meili_master_key: config.meili_master_key.clone(),
api_base_url: config.api_base_url.clone(),
api_bootstrap_token: config.api_bootstrap_token.clone(),
};
tokio::spawn(indexer::worker::run_worker(state.clone(), config.scan_interval_seconds));

View File

@@ -100,7 +100,7 @@ pub async fn sync_meili(pool: &PgPool, meili_url: &str, meili_master_key: &str)
const MEILI_BATCH_SIZE: usize = 1000;
for (i, chunk) in docs.chunks(MEILI_BATCH_SIZE).enumerate() {
let batch_num = i + 1;
info!("[MEILI] Sending batch {}/{} ({} docs)", batch_num, (doc_count + MEILI_BATCH_SIZE - 1) / MEILI_BATCH_SIZE, chunk.len());
info!("[MEILI] Sending batch {}/{} ({} docs)", batch_num, doc_count.div_ceil(MEILI_BATCH_SIZE), chunk.len());
let response = client
.post(format!("{base}/indexes/books/documents"))

View File

@@ -1,7 +1,6 @@
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use parsers::{detect_format, parse_metadata, BookFormat, ParsedMetadata};
use rayon::prelude::*;
use parsers::{detect_format, parse_metadata_fast};
use serde::Serialize;
use sqlx::Row;
use std::{collections::HashMap, path::Path, time::Duration};
@@ -26,7 +25,11 @@ pub struct JobStats {
const BATCH_SIZE: usize = 100;
pub async fn scan_library(
/// Phase 1 — Discovery: walk filesystem, extract metadata from filenames only (no archive I/O).
/// New books are inserted with page_count = NULL so the analyzer phase can fill them in.
/// Updated books (fingerprint changed) get page_count/thumbnail reset.
#[allow(clippy::too_many_arguments)]
pub async fn scan_library_discovery(
state: &AppState,
job_id: Uuid,
library_id: Uuid,
@@ -36,8 +39,14 @@ pub async fn scan_library(
total_files: usize,
is_full_rebuild: bool,
) -> Result<()> {
info!("[SCAN] Starting scan of library {} at path: {} (full_rebuild={})", library_id, root.display(), is_full_rebuild);
info!(
"[SCAN] Starting discovery scan of library {} at path: {} (full_rebuild={})",
library_id,
root.display(),
is_full_rebuild
);
// Load existing files from DB
let existing_rows = sqlx::query(
r#"
SELECT bf.id AS file_id, bf.book_id, bf.abs_path, bf.fingerprint
@@ -60,15 +69,46 @@ pub async fn scan_library(
(row.get("file_id"), row.get("book_id"), row.get("fingerprint")),
);
}
info!("[SCAN] Found {} existing files in database for library {}", existing.len(), library_id);
info!(
"[SCAN] Found {} existing files in database for library {}",
existing.len(),
library_id
);
} else {
info!("[SCAN] Full rebuild: skipping existing files lookup (all will be treated as new)");
info!("[SCAN] Full rebuild: skipping existing files lookup");
// Delete stale directory mtime records for full rebuild
let _ = sqlx::query("DELETE FROM directory_mtimes WHERE library_id = $1")
.bind(library_id)
.execute(&state.pool)
.await;
}
// Load stored directory mtimes for incremental skip
let dir_mtimes: HashMap<String, DateTime<Utc>> = if !is_full_rebuild {
let rows = sqlx::query(
"SELECT dir_path, mtime FROM directory_mtimes WHERE library_id = $1",
)
.bind(library_id)
.fetch_all(&state.pool)
.await
.unwrap_or_default();
rows.into_iter()
.map(|row| {
let db_path: String = row.get("dir_path");
let local_path = utils::remap_libraries_path(&db_path);
let mtime: DateTime<Utc> = row.get("mtime");
(local_path, mtime)
})
.collect()
} else {
HashMap::new()
};
let mut seen: HashMap<String, bool> = HashMap::new();
let mut library_processed_count = 0i32;
let mut last_progress_update = std::time::Instant::now();
// Batching buffers
let mut books_to_update: Vec<BookUpdate> = Vec::with_capacity(BATCH_SIZE);
let mut files_to_update: Vec<FileUpdate> = Vec::with_capacity(BATCH_SIZE);
@@ -76,37 +116,85 @@ pub async fn scan_library(
let mut files_to_insert: Vec<FileInsert> = Vec::with_capacity(BATCH_SIZE);
let mut errors_to_insert: Vec<ErrorInsert> = Vec::with_capacity(BATCH_SIZE);
// Step 1: Collect all book files first
#[derive(Clone)]
struct FileInfo {
path: std::path::PathBuf,
format: BookFormat,
abs_path: String,
file_name: String,
metadata: std::fs::Metadata,
mtime: DateTime<Utc>,
fingerprint: String,
lookup_path: String,
}
// Track discovered directory mtimes for upsert after scan
let mut new_dir_mtimes: Vec<(String, DateTime<Utc>)> = Vec::new();
// Prefixes (with trailing "/") of directories whose mtime hasn't changed.
// Files under these prefixes are added to `seen` but not reprocessed.
let mut skipped_dir_prefixes: Vec<String> = Vec::new();
let mut file_infos: Vec<FileInfo> = Vec::new();
for entry in WalkDir::new(root).into_iter().filter_map(Result::ok) {
let path = entry.path().to_path_buf();
let local_path = path.to_string_lossy().to_string();
if entry.file_type().is_dir() {
if entry.depth() == 0 {
continue; // skip root itself
}
// Check if parent dir is already skipped (propagate skip to subdirs)
let already_under_skipped = skipped_dir_prefixes
.iter()
.any(|p| local_path.starts_with(p.as_str()));
if let Ok(meta) = entry.metadata() {
if let Ok(sys_mtime) = meta.modified() {
let mtime_utc: DateTime<Utc> = DateTime::from(sys_mtime);
// Only record mtimes for non-skipped dirs (to avoid polluting DB)
if !already_under_skipped {
new_dir_mtimes.push((local_path.clone(), mtime_utc));
}
// Skip if mtime unchanged (incremental only, not already skipped subtree)
if !is_full_rebuild && !already_under_skipped {
if let Some(&stored_mtime) = dir_mtimes.get(&local_path) {
if mtime_utc <= stored_mtime {
trace!("[SCAN] Skipping unchanged dir: {}", local_path);
// Add trailing slash so starts_with check is exact per-segment
skipped_dir_prefixes.push(format!("{}/", local_path));
}
}
}
}
}
continue;
}
if !entry.file_type().is_file() {
continue;
}
let path = entry.path().to_path_buf();
// Check if this file is under a skipped dir
let under_skipped = skipped_dir_prefixes
.iter()
.any(|p| local_path.starts_with(p.as_str()));
if under_skipped {
// Dir unchanged — just mark file as seen so it's not deleted
let abs_path_local = local_path.clone();
let abs_path = utils::unmap_libraries_path(&abs_path_local);
let lookup_path = utils::remap_libraries_path(&abs_path);
seen.insert(lookup_path, true);
continue;
}
let Some(format) = detect_format(&path) else {
trace!("[SCAN] Skipping non-book file: {}", path.display());
continue;
};
info!("[SCAN] Found book file: {} (format: {:?})", path.display(), format);
info!(
"[SCAN] Found book file: {} (format: {:?})",
path.display(),
format
);
stats.scanned_files += 1;
let abs_path_local = path.to_string_lossy().to_string();
let abs_path = utils::unmap_libraries_path(&abs_path_local);
let file_name = path.file_name()
let file_name = path
.file_name()
.map(|s| s.to_string_lossy().to_string())
.unwrap_or_else(|| abs_path.clone());
@@ -119,38 +207,12 @@ pub async fn scan_library(
let fingerprint = utils::compute_fingerprint(&path, metadata.len(), &mtime)?;
let lookup_path = utils::remap_libraries_path(&abs_path);
file_infos.push(FileInfo {
path,
format,
abs_path,
file_name,
metadata,
mtime,
fingerprint,
lookup_path,
});
}
info!("[SCAN] Collected {} files, starting parallel parsing", file_infos.len());
// Step 2: Parse metadata in parallel
let parsed_results: Vec<(FileInfo, Result<ParsedMetadata>)> = file_infos
.into_par_iter()
.map(|file_info| {
let parse_result = parse_metadata(&file_info.path, file_info.format, root);
(file_info, parse_result)
})
.collect();
info!("[SCAN] Completed parallel parsing, processing {} results", parsed_results.len());
// Step 3: Process results sequentially for batch inserts
for (file_info, parse_result) in parsed_results {
library_processed_count += 1;
*total_processed_count += 1;
// Update progress in DB every 1 second or every 10 files
let should_update_progress = last_progress_update.elapsed() > Duration::from_secs(1) || library_processed_count % 10 == 0;
// Progress update
let should_update_progress = last_progress_update.elapsed() > Duration::from_secs(1)
|| library_processed_count % 10 == 0;
if should_update_progress {
let progress_percent = if total_files > 0 {
((*total_processed_count as f64 / total_files as f64) * 100.0) as i32
@@ -159,10 +221,10 @@ pub async fn scan_library(
};
sqlx::query(
"UPDATE index_jobs SET current_file = $2, processed_files = $3, progress_percent = $4 WHERE id = $1"
"UPDATE index_jobs SET current_file = $2, processed_files = $3, progress_percent = $4 WHERE id = $1",
)
.bind(job_id)
.bind(&file_info.file_name)
.bind(&file_name)
.bind(*total_processed_count)
.bind(progress_percent)
.execute(&state.pool)
@@ -171,189 +233,210 @@ pub async fn scan_library(
error!("[BDD] Failed to update progress for job {}: {}", job_id, e);
e
})?;
last_progress_update = std::time::Instant::now();
// Check if job has been cancelled
if is_job_cancelled(&state.pool, job_id).await? {
info!("[JOB] Job {} cancelled by user, stopping...", job_id);
// Flush any pending batches before exiting
flush_all_batches(&state.pool, &mut books_to_update, &mut files_to_update, &mut books_to_insert, &mut files_to_insert, &mut errors_to_insert).await?;
flush_all_batches(
&state.pool,
&mut books_to_update,
&mut files_to_update,
&mut books_to_insert,
&mut files_to_insert,
&mut errors_to_insert,
)
.await?;
return Err(anyhow::anyhow!("Job cancelled by user"));
}
}
let seen_key = utils::remap_libraries_path(&file_info.abs_path);
seen.insert(seen_key.clone(), true);
seen.insert(lookup_path.clone(), true);
if let Some((file_id, book_id, old_fingerprint)) = existing.get(&file_info.lookup_path).cloned() {
if !is_full_rebuild && old_fingerprint == file_info.fingerprint {
trace!("[PROCESS] Skipping unchanged file: {}", file_info.file_name);
// Fast metadata extraction — no archive I/O
let parsed = parse_metadata_fast(&path, format, root);
if let Some((file_id, book_id, old_fingerprint)) =
existing.get(&lookup_path).cloned()
{
if !is_full_rebuild && old_fingerprint == fingerprint {
trace!("[PROCESS] Skipping unchanged file: {}", file_name);
continue;
}
info!("[PROCESS] Updating existing file: {} (full_rebuild={}, fingerprint_match={})", file_info.file_name, is_full_rebuild, old_fingerprint == file_info.fingerprint);
info!(
"[PROCESS] Updating existing file: {} (fingerprint_changed={})",
file_name,
old_fingerprint != fingerprint
);
match parse_result {
Ok(parsed) => {
books_to_update.push(BookUpdate {
book_id,
title: parsed.title,
kind: utils::kind_from_format(file_info.format).to_string(),
series: parsed.series,
volume: parsed.volume,
page_count: parsed.page_count,
});
books_to_update.push(BookUpdate {
book_id,
title: parsed.title,
kind: utils::kind_from_format(format).to_string(),
series: parsed.series,
volume: parsed.volume,
// Reset page_count so analyzer re-processes this book
page_count: None,
});
files_to_update.push(FileUpdate {
file_id,
format: file_info.format.as_str().to_string(),
size_bytes: file_info.metadata.len() as i64,
mtime: file_info.mtime,
fingerprint: file_info.fingerprint,
});
files_to_update.push(FileUpdate {
file_id,
format: format.as_str().to_string(),
size_bytes: metadata.len() as i64,
mtime,
fingerprint,
});
stats.indexed_files += 1;
}
Err(err) => {
warn!("[PARSER] Failed to parse {}: {}", file_info.file_name, err);
stats.errors += 1;
files_to_update.push(FileUpdate {
file_id,
format: file_info.format.as_str().to_string(),
size_bytes: file_info.metadata.len() as i64,
mtime: file_info.mtime,
fingerprint: file_info.fingerprint.clone(),
});
errors_to_insert.push(ErrorInsert {
job_id,
file_path: file_info.abs_path.clone(),
error_message: err.to_string(),
});
// Also need to mark file as error - we'll do this separately
sqlx::query(
"UPDATE book_files SET parse_status = 'error', parse_error_opt = $2 WHERE id = $1"
)
.bind(file_id)
.bind(err.to_string())
.execute(&state.pool)
.await?;
}
// Also clear thumbnail so it gets regenerated
if let Err(e) = sqlx::query(
"UPDATE books SET thumbnail_path = NULL WHERE id = $1",
)
.bind(book_id)
.execute(&state.pool)
.await
{
warn!(
"[BDD] Failed to clear thumbnail for book {}: {}",
book_id, e
);
}
// Flush if batch is full
stats.indexed_files += 1;
if books_to_update.len() >= BATCH_SIZE || files_to_update.len() >= BATCH_SIZE {
flush_all_batches(&state.pool, &mut books_to_update, &mut files_to_update, &mut books_to_insert, &mut files_to_insert, &mut errors_to_insert).await?;
flush_all_batches(
&state.pool,
&mut books_to_update,
&mut files_to_update,
&mut books_to_insert,
&mut files_to_insert,
&mut errors_to_insert,
)
.await?;
}
continue;
}
// New file (thumbnails generated by API after job handoff)
info!("[PROCESS] Inserting new file: {}", file_info.file_name);
// New file — insert with page_count = NULL (analyzer fills it in)
info!("[PROCESS] Inserting new file: {}", file_name);
let book_id = Uuid::new_v4();
let file_id = Uuid::new_v4();
match parse_result {
Ok(parsed) => {
let file_id = Uuid::new_v4();
books_to_insert.push(BookInsert {
book_id,
library_id,
kind: utils::kind_from_format(format).to_string(),
title: parsed.title,
series: parsed.series,
volume: parsed.volume,
page_count: None,
thumbnail_path: None,
});
books_to_insert.push(BookInsert {
book_id,
library_id,
kind: utils::kind_from_format(file_info.format).to_string(),
title: parsed.title,
series: parsed.series,
volume: parsed.volume,
page_count: parsed.page_count,
thumbnail_path: None,
});
files_to_insert.push(FileInsert {
file_id,
book_id,
format: format.as_str().to_string(),
abs_path: abs_path.clone(),
size_bytes: metadata.len() as i64,
mtime,
fingerprint,
parse_status: "ok".to_string(),
parse_error: None,
});
files_to_insert.push(FileInsert {
file_id,
book_id,
format: file_info.format.as_str().to_string(),
abs_path: file_info.abs_path.clone(),
size_bytes: file_info.metadata.len() as i64,
mtime: file_info.mtime,
fingerprint: file_info.fingerprint,
parse_status: "ok".to_string(),
parse_error: None,
});
stats.indexed_files += 1;
stats.indexed_files += 1;
}
Err(err) => {
warn!("[PARSER] Failed to parse {}: {}", file_info.file_name, err);
stats.errors += 1;
let book_id = Uuid::new_v4();
let file_id = Uuid::new_v4();
books_to_insert.push(BookInsert {
book_id,
library_id,
kind: utils::kind_from_format(file_info.format).to_string(),
title: utils::file_display_name(&file_info.path),
series: None,
volume: None,
page_count: None,
thumbnail_path: None,
});
files_to_insert.push(FileInsert {
file_id,
book_id,
format: file_info.format.as_str().to_string(),
abs_path: file_info.abs_path.clone(),
size_bytes: file_info.metadata.len() as i64,
mtime: file_info.mtime,
fingerprint: file_info.fingerprint,
parse_status: "error".to_string(),
parse_error: Some(err.to_string()),
});
errors_to_insert.push(ErrorInsert {
job_id,
file_path: file_info.abs_path,
error_message: err.to_string(),
});
}
}
// Flush if batch is full
if books_to_insert.len() >= BATCH_SIZE || files_to_insert.len() >= BATCH_SIZE {
flush_all_batches(&state.pool, &mut books_to_update, &mut files_to_update, &mut books_to_insert, &mut files_to_insert, &mut errors_to_insert).await?;
flush_all_batches(
&state.pool,
&mut books_to_update,
&mut files_to_update,
&mut books_to_insert,
&mut files_to_insert,
&mut errors_to_insert,
)
.await?;
}
}
// Final flush of any remaining items
flush_all_batches(&state.pool, &mut books_to_update, &mut files_to_update, &mut books_to_insert, &mut files_to_insert, &mut errors_to_insert).await?;
// Flush remaining batches
flush_all_batches(
&state.pool,
&mut books_to_update,
&mut files_to_update,
&mut books_to_insert,
&mut files_to_insert,
&mut errors_to_insert,
)
.await?;
info!("[SCAN] Library {} scan complete: {} files scanned, {} indexed, {} errors",
library_id, library_processed_count, stats.indexed_files, stats.errors);
if !skipped_dir_prefixes.is_empty() {
info!(
"[SCAN] Skipped {} unchanged directories",
skipped_dir_prefixes.len()
);
}
info!(
"[SCAN] Library {} discovery complete: {} files scanned, {} indexed, {} errors",
library_id, library_processed_count, stats.indexed_files, stats.errors
);
// Handle deletions
let mut removed_count = 0usize;
for (abs_path, (file_id, book_id, _)) in existing {
if seen.contains_key(&abs_path) {
for (abs_path, (file_id, book_id, _)) in &existing {
if seen.contains_key(abs_path) {
continue;
}
sqlx::query("DELETE FROM book_files WHERE id = $1")
.bind(file_id)
.execute(&state.pool)
.await?;
sqlx::query("DELETE FROM books WHERE id = $1 AND NOT EXISTS (SELECT 1 FROM book_files WHERE book_id = $1)")
.bind(book_id)
.execute(&state.pool)
.await?;
sqlx::query(
"DELETE FROM books WHERE id = $1 AND NOT EXISTS (SELECT 1 FROM book_files WHERE book_id = $1)",
)
.bind(book_id)
.execute(&state.pool)
.await?;
stats.removed_files += 1;
removed_count += 1;
}
if removed_count > 0 {
info!("[SCAN] Removed {} stale files from database", removed_count);
info!(
"[SCAN] Removed {} stale files from database",
removed_count
);
}
// Upsert directory mtimes for next incremental scan
if !new_dir_mtimes.is_empty() {
let dir_paths_db: Vec<String> = new_dir_mtimes
.iter()
.map(|(local, _)| utils::unmap_libraries_path(local))
.collect();
let mtimes: Vec<DateTime<Utc>> = new_dir_mtimes.iter().map(|(_, m)| *m).collect();
let library_ids: Vec<Uuid> = vec![library_id; new_dir_mtimes.len()];
if let Err(e) = sqlx::query(
r#"
INSERT INTO directory_mtimes (library_id, dir_path, mtime)
SELECT * FROM UNNEST($1::uuid[], $2::text[], $3::timestamptz[])
AS t(library_id, dir_path, mtime)
ON CONFLICT (library_id, dir_path) DO UPDATE SET mtime = EXCLUDED.mtime
"#,
)
.bind(&library_ids)
.bind(&dir_paths_db)
.bind(&mtimes)
.execute(&state.pool)
.await
{
warn!("[SCAN] Failed to upsert directory mtimes: {}", e);
}
}
Ok(())

View File

@@ -138,7 +138,7 @@ fn setup_watcher(
})?;
// Actually watch the library directories
for (_, root_path) in &libraries {
for root_path in libraries.values() {
info!("[WATCHER] Watching directory: {}", root_path);
watcher.watch(std::path::Path::new(root_path), RecursiveMode::Recursive)?;
}