Skip to content

CaptureClub-LLC/Lucet

Repository files navigation

Lucet

A lightweight, self-contained full-text search engine with a premium Web UI and REST API. Built entirely in Python — no Java, no Elasticsearch, no external search infrastructure.

Searches any text file using the standard Lucene query language (title:twain, author:dick*, content:whale, word_count:[0 TO 5000]) with instant metadata search and on-demand full-text loading.


Features

  • Full Lucene query syntax — field queries, wildcards, phrases, ranges, boolean operators
  • Two-tier lazy index — metadata for all corpora always in memory (~3 MB); full text loaded only on demand
  • Zero content pre-loading — startup takes ~1 second; no 966 MB RAM spike
  • Pickle cache — second startup loads instantly from index_cache.pkl
  • Located passage snippets — each result shows up to 5 highlighted excerpts, each with a visual mini progress-bar showing where in the document the match was found
  • REST API — easily integrate with other services
  • Premium Web UI — dark glassmorphism, library browser, drag-and-drop upload, document viewer
  • Search-results sidebar — after any search the Library panel narrows to show only matching documents; click "✕ All books" to restore the full list
  • File upload — index your own .txt, .json, or .csv documents

Quick Start

Prerequisites

  • Python 3.12+
  • pip

Install & Run

git clone <repo-url>
cd Lucet
pip install -r requirements.txt

# Using the management script (recommended)
./lucet_ui.sh start      # start in background
./lucet_ui.sh status     # show PID + memory usage
./lucet_ui.sh restart    # restart after code changes
./lucet_ui.sh logs       # tail the log file
./lucet_ui.sh stop       # graceful shutdown

# Or run directly (foreground)
python3 main.py

Open http://localhost:8000 in your browser.

Environment overrides:

LUCENE_PORT=9000 ./lucet_ui.sh start          # different port
LUCENE_HOST=127.0.0.1 ./lucet_ui.sh start     # localhost only
LUCENE_PYTHON=/usr/bin/python3.12 ./lucet_ui.sh start

The first run scans the training_data/ directory (~1–2 sec) and saves a cache. Every subsequent start loads from cache in under 0.1 sec.


Search Query Syntax

Lucet implements the standard Lucene query language via luqum.

Field Queries (search all 2,318 books instantly)

title:dickens
author:twain
filename:moby*
word_count:[0 TO 5000]
size_bytes:[100000 TO *]

Content Queries (search loaded books only)

content:whale
content:"call me ishmael"

Load a book first — click Load next to any book in the library, then run content queries.
Each result shows up to 5 highlighted passages from inside the text, each tagged with its approximate word position and percentage through the document.

Wildcards

Pattern Meaning
title:mo* title starts with "mo"
title:mo?y moby, moly, etc.

Matching is substring and case-insensitivecontent:whale also matches "whales", "whaleship", etc. This is intentional; recall is always 1.0.

Boolean Operators

author:twain OR author:dickens
title:moby AND content:whale
NOT author:shakespeare
title:war AND NOT title:peace

Phrase Search

"call me ishmael"
content:"it was the best of times"

Range Queries

word_count:[10000 TO 50000]
word_count:[100000 TO *]
size_bytes:[* TO 100000]

Web UI

Panel Description
Library Paginated list of all 2,318 books with filter and Load buttons. After a search, narrows to show only matching documents with a purple banner; click ✕ All books to restore.
Loaded Shows content-loaded documents; unload to free memory
Upload Drag-and-drop .txt, .json, .csv — indexed with full content
Search Results Card-based results. Each card shows title, author, word count, and up to 5 located passage snippets — each with a mini progress-bar dot showing where in the document the match falls, plus an estimated word position and percentage.
Document Viewer Click any result or library entry to read the text
Footer Live chips showing which books are content-loaded

REST API

Base URL: http://localhost:8000

Interactive docs: http://localhost:8000/docs

Search

GET /api/search?q=content:whale&limit=50
{
  "query": "content:whale",
  "total_hits": 2,
  "content_docs_searched": 5,
  "disk_docs_searched": 0,
  "total_docs": 2318,
  "hits": [
    {
      "_id": "abc123",
      "title": "Moby Dick",
      "author": "Herman Melville",
      "word_count": 209117,
      "size_bytes": 1215834,
      "content_loaded": true,
      "uploaded": false,
      "_snippets": [
        {
          "text": "…the great white whale breached the surface once more…",
          "pct": 12,
          "word": 25080
        },
        {
          "text": "…whale oil filled the barrels to the brim…",
          "pct": 47,
          "word": 98240
        }
      ]
    }
  ]
}

_snippets is present on every hit where content_loaded: true. Each entry:

Field Type Description
text string ~280-char passage centred on the match, with ellipsis
pct int 0–100 How far through the document this match is
word int Estimated word number of the match position

Metadata-only queries (e.g. author:twain) return hits without _snippets.

Load a Document's Content

POST /api/documents/{id}/load

Reads the file from disk into memory. After this, content queries will match this document and results will include _snippets.

Unload a Document's Content

DELETE /api/documents/{id}/unload

Evicts the full text from memory; metadata remains.

List Documents

GET /api/documents?page=1&per_page=50&q=twain&loaded_only=false

Upload a File

POST /api/upload
Content-Type: multipart/form-data

file=@myfile.txt

Supports .txt, .json (object or array), .csv (each row = one document).

Index JSON Directly

POST /api/index
Content-Type: application/json

{"title": "My Report", "author": "Alice", "content": "Hello world..."}

Stats

GET /api/stats
{
  "total_docs": 2318,
  "content_loaded": 3,
  "uploaded": 1,
  "cache_exists": true
}

Unload All Content

DELETE /api/index/content

Frees all loaded book content from memory (preserves uploaded documents).


Project Structure

Lucet/
├── lucene_engine.py      # Pure-Python Lucene query engine (embeddable standalone)
├── main.py               # FastAPI application + REST API
├── requirements.txt      # Python dependencies (server)
├── pyproject.toml        # Package metadata for library use / PyPI
├── LICENSE               # MIT
├── .gitignore
├── lucene.sh             # Process management script (start/stop/logs)
├── static/
│   ├── index.html        # Single-page Web UI
│   ├── style.css         # Dark glassmorphism styling
│   └── app.js            # UI logic (vanilla JS, no framework)
├── training_data/        # 15 bootstrap texts included; add more from Project Gutenberg
├── context.txt           # Original product requirements
├── CLAUDE.md             # AI agent instructions
└── README.md             # This file

Corpus

The repo ships with 15 bootstrap texts in training_data/ so the demo works out of the box:

Text Author Genre
2 B R 0 2 B Kurt Vonnegut Sci-fi
A Dog's Tale Mark Twain Humor
Extracts from Adam's Diary Mark Twain Humor
A Modest Proposal Jonathan Swift Satire
The Gift of the Magi O. Henry Short story
The Monkey's Paw W.W. Jacobs Horror
His Last Bow Arthur Conan Doyle Mystery
The Parenticide Club Ambrose Bierce Dark humor
Beyond Lies the Wub Philip K. Dick Sci-fi
Crystal Crypt Philip K. Dick Sci-fi
The Rime of the Ancient Mariner Samuel Taylor Coleridge Poetry
Songs of Innocence and Experience William Blake Poetry
Give Me Liberty Patrick Henry Historical speech
Declaration of Independence United States of America Historical document
The Constitution of the United States United States of America Historical document

To add more, place additional .txt files from Project Gutenberg in training_data/ using the naming convention:

Title-Words-Hyphenated-by-Author-Name.txt
# e.g. Moby-Dick-by-Herman-Melville.txt

The full corpus used in development is 2,318 texts (~966 MB) — too large for git. The server and all API endpoints work with any number of documents including zero (upload your own via the UI or /api/index).


Use Lucet as a Library

lucene_engine.py is a self-contained module with no web-framework dependency. Drop it into any Python project and use it as an embedded search index:

from lucene_engine import LuceneEngine

engine = LuceneEngine()

# Index documents — any dict with any fields
engine.add_document({"title": "My Doc", "author": "Alice", "content": "Hello world, this is a test."})
engine.add_document({"title": "Another Doc", "author": "Bob", "content": "Foo bar baz content here."})

# Search with full Lucene query syntax
result = engine.search("content:hello")
print(result.hits)          # list of matching document dicts (content excluded)
print(result.total_docs)    # 2

# Field queries, booleans, wildcards, ranges — all work
result = engine.search('author:Alice AND content:test')
result = engine.search('title:doc* AND NOT author:bob')
result = engine.search('word_count:[1000 TO *]')

# Each hit on a content-loaded doc includes located passage snippets
for hit in result.hits:
    for snippet in hit.get("_snippets", []):
        print(f"  ~word {snippet['word']} ({snippet['pct']}%): {snippet['text']}")

LuceneEngine requires only luqum (pip install luqum). No server, no Java, no network.


How the Index Works

Two-Tier Design

The index never pre-loads file content:

Startup (~1 sec)
└── Scan training_data/ filenames
    └── Parse title + author from filename
        └── os.stat() for size
            └── Store metadata dict (no file reads)
                └── Save to index_cache.pkl

When a user clicks Load:

POST /api/documents/{id}/load
└── Read file from disk (once)
    └── Store text in memory
        └── content_loaded = True
            └── Now searchable with content: queries
                └── Results include located passage snippets (_snippets)

Memory Usage

State RAM
Server start (2,318 books, metadata only) ~3 MB
After loading one typical novel (~400 KB) ~3.4 MB
After loading 10 novels ~7 MB
After loading ALL books (if you wanted to) ~969 MB

Search Logic

Query type Docs searched Snippets included
title:, author:, word_count:, etc. All 2,318 (metadata tier) No
content: or bare terms Only content-loaded documents Yes — up to 5 located passages
Mixed (title:moby AND content:whale) Metadata filter applied to all; content filter applied to loaded subset Yes

Snippet Extraction

For every content-loaded document that matches a query, the engine:

  1. Runs re.finditer() for all query terms across the full document text
  2. Sorts and deduplicates match positions (merging hits within half a snippet-window of each other)
  3. Extracts up to 5 passages of ~280 characters each, walking back to natural word/line boundaries
  4. Records pct (position ÷ document length × 100) and word (position ÷ 5.5, same ratio as word-count estimation) for each passage

Project Gutenberg boilerplate headers are detected and skipped when no match is found in the body text.


Dependencies

Package Purpose
fastapi REST API framework
uvicorn ASGI server
luqum Lucene query parser (AST)
python-multipart File upload support

No Java. No Elasticsearch. No ML models. No vector databases.


Adding Your Own Documents

Via the UI: drag and drop any .txt, .json, or .csv into the Upload panel.

Via the API:

# Upload a text file
curl -X POST http://localhost:8000/api/upload \
  -F "file=@my_document.txt"

# Index a JSON object directly
curl -X POST http://localhost:8000/api/index \
  -H "Content-Type: application/json" \
  -d '{"title": "My Doc", "author": "Me", "content": "Full text here..."}'

# Bulk index a JSON array
curl -X POST http://localhost:8000/api/index/bulk \
  -H "Content-Type: application/json" \
  -d '[{"title": "Doc 1", "content": "..."}, {"title": "Doc 2", "content": "..."}]'

Uploaded documents have their full content always in memory and are searchable immediately with any query type.


Resetting the Cache

rm index_cache.pkl
python3 main.py   # rebuilds in ~1-2 sec

License

MIT License

Copyright (c) 2026 Lucet Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Training data is from Project Gutenberg and is in the public domain.

About

A Python-based mini-Solr implemenation of Lucene

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors