Lucet

A lightweight, self-contained full-text search engine with a premium Web UI and REST API. Built entirely in Python — no Java, no Elasticsearch, no external search infrastructure.

Searches any text file using the standard Lucene query language (title:twain, author:dick*, content:whale, word_count:[0 TO 5000]) with instant metadata search and on-demand full-text loading.

Features

Full Lucene query syntax — field queries, wildcards, phrases, ranges, boolean operators
Two-tier lazy index — metadata for all corpora always in memory (~3 MB); full text loaded only on demand
Zero content pre-loading — startup takes ~1 second; no 966 MB RAM spike
Pickle cache — second startup loads instantly from index_cache.pkl
Located passage snippets — each result shows up to 5 highlighted excerpts, each with a visual mini progress-bar showing where in the document the match was found
REST API — easily integrate with other services
Premium Web UI — dark glassmorphism, library browser, drag-and-drop upload, document viewer
Search-results sidebar — after any search the Library panel narrows to show only matching documents; click "✕ All books" to restore the full list
File upload — index your own .txt, .json, or .csv documents

Quick Start

Prerequisites

Python 3.12+
pip

Install & Run

git clone <repo-url>
cd Lucet
pip install -r requirements.txt

# Using the management script (recommended)
./lucet_ui.sh start      # start in background
./lucet_ui.sh status     # show PID + memory usage
./lucet_ui.sh restart    # restart after code changes
./lucet_ui.sh logs       # tail the log file
./lucet_ui.sh stop       # graceful shutdown

# Or run directly (foreground)
python3 main.py

Open http://localhost:8000 in your browser.

Environment overrides:

LUCENE_PORT=9000 ./lucet_ui.sh start          # different port
LUCENE_HOST=127.0.0.1 ./lucet_ui.sh start     # localhost only
LUCENE_PYTHON=/usr/bin/python3.12 ./lucet_ui.sh start

The first run scans the training_data/ directory (~1–2 sec) and saves a cache. Every subsequent start loads from cache in under 0.1 sec.

Search Query Syntax

Lucet implements the standard Lucene query language via luqum.

Field Queries (search all 2,318 books instantly)

title:dickens
author:twain
filename:moby*
word_count:[0 TO 5000]
size_bytes:[100000 TO *]

Content Queries (search loaded books only)

content:whale
content:"call me ishmael"

Load a book first — click Load next to any book in the library, then run content queries.
Each result shows up to 5 highlighted passages from inside the text, each tagged with its approximate word position and percentage through the document.

Wildcards

Pattern	Meaning
`title:mo*`	title starts with "mo"
`title:mo?y`	`moby`, `moly`, etc.

Matching is substring and case-insensitive — content:whale also matches "whales", "whaleship", etc. This is intentional; recall is always 1.0.

Boolean Operators

author:twain OR author:dickens
title:moby AND content:whale
NOT author:shakespeare
title:war AND NOT title:peace

Phrase Search

"call me ishmael"
content:"it was the best of times"

Range Queries

word_count:[10000 TO 50000]
word_count:[100000 TO *]
size_bytes:[* TO 100000]

Web UI

Panel	Description
Library	Paginated list of all 2,318 books with filter and Load buttons. After a search, narrows to show only matching documents with a purple banner; click ✕ All books to restore.
Loaded	Shows content-loaded documents; unload to free memory
Upload	Drag-and-drop `.txt`, `.json`, `.csv` — indexed with full content
Search Results	Card-based results. Each card shows title, author, word count, and up to 5 located passage snippets — each with a mini progress-bar dot showing where in the document the match falls, plus an estimated word position and percentage.
Document Viewer	Click any result or library entry to read the text
Footer	Live chips showing which books are content-loaded

REST API

Base URL: http://localhost:8000

Interactive docs: http://localhost:8000/docs

Search

GET /api/search?q=content:whale&limit=50

{
  "query": "content:whale",
  "total_hits": 2,
  "content_docs_searched": 5,
  "disk_docs_searched": 0,
  "total_docs": 2318,
  "hits": [
    {
      "_id": "abc123",
      "title": "Moby Dick",
      "author": "Herman Melville",
      "word_count": 209117,
      "size_bytes": 1215834,
      "content_loaded": true,
      "uploaded": false,
      "_snippets": [
        {
          "text": "…the great white whale breached the surface once more…",
          "pct": 12,
          "word": 25080
        },
        {
          "text": "…whale oil filled the barrels to the brim…",
          "pct": 47,
          "word": 98240
        }
      ]
    }
  ]
}

_snippets is present on every hit where content_loaded: true. Each entry:

Field	Type	Description
`text`	string	~280-char passage centred on the match, with `…` ellipsis
`pct`	int 0–100	How far through the document this match is
`word`	int	Estimated word number of the match position

Metadata-only queries (e.g. author:twain) return hits without _snippets.

Load a Document's Content

POST /api/documents/{id}/load

Reads the file from disk into memory. After this, content queries will match this document and results will include _snippets.

Unload a Document's Content

DELETE /api/documents/{id}/unload

Evicts the full text from memory; metadata remains.

List Documents

GET /api/documents?page=1&per_page=50&q=twain&loaded_only=false

Upload a File

POST /api/upload
Content-Type: multipart/form-data

file=@myfile.txt

Supports .txt, .json (object or array), .csv (each row = one document).

Index JSON Directly

POST /api/index
Content-Type: application/json

{"title": "My Report", "author": "Alice", "content": "Hello world..."}

Stats

GET /api/stats

{
  "total_docs": 2318,
  "content_loaded": 3,
  "uploaded": 1,
  "cache_exists": true
}

Unload All Content

DELETE /api/index/content

Frees all loaded book content from memory (preserves uploaded documents).

Project Structure

Lucet/
├── lucene_engine.py      # Pure-Python Lucene query engine (embeddable standalone)
├── main.py               # FastAPI application + REST API
├── requirements.txt      # Python dependencies (server)
├── pyproject.toml        # Package metadata for library use / PyPI
├── LICENSE               # MIT
├── .gitignore
├── lucene.sh             # Process management script (start/stop/logs)
├── static/
│   ├── index.html        # Single-page Web UI
│   ├── style.css         # Dark glassmorphism styling
│   └── app.js            # UI logic (vanilla JS, no framework)
├── training_data/        # 15 bootstrap texts included; add more from Project Gutenberg
├── context.txt           # Original product requirements
├── CLAUDE.md             # AI agent instructions
└── README.md             # This file

Corpus

The repo ships with 15 bootstrap texts in training_data/ so the demo works out of the box:

Text	Author	Genre
2 B R 0 2 B	Kurt Vonnegut	Sci-fi
A Dog's Tale	Mark Twain	Humor
Extracts from Adam's Diary	Mark Twain	Humor
A Modest Proposal	Jonathan Swift	Satire
The Gift of the Magi	O. Henry	Short story
The Monkey's Paw	W.W. Jacobs	Horror
His Last Bow	Arthur Conan Doyle	Mystery
The Parenticide Club	Ambrose Bierce	Dark humor
Beyond Lies the Wub	Philip K. Dick	Sci-fi
Crystal Crypt	Philip K. Dick	Sci-fi
The Rime of the Ancient Mariner	Samuel Taylor Coleridge	Poetry
Songs of Innocence and Experience	William Blake	Poetry
Give Me Liberty	Patrick Henry	Historical speech
Declaration of Independence	United States of America	Historical document
The Constitution of the United States	United States of America	Historical document

To add more, place additional .txt files from Project Gutenberg in training_data/ using the naming convention:

Title-Words-Hyphenated-by-Author-Name.txt
# e.g. Moby-Dick-by-Herman-Melville.txt

The full corpus used in development is 2,318 texts (~966 MB) — too large for git. The server and all API endpoints work with any number of documents including zero (upload your own via the UI or /api/index).

Use Lucet as a Library

lucene_engine.py is a self-contained module with no web-framework dependency. Drop it into any Python project and use it as an embedded search index:

from lucene_engine import LuceneEngine

engine = LuceneEngine()

# Index documents — any dict with any fields
engine.add_document({"title": "My Doc", "author": "Alice", "content": "Hello world, this is a test."})
engine.add_document({"title": "Another Doc", "author": "Bob", "content": "Foo bar baz content here."})

# Search with full Lucene query syntax
result = engine.search("content:hello")
print(result.hits)          # list of matching document dicts (content excluded)
print(result.total_docs)    # 2

# Field queries, booleans, wildcards, ranges — all work
result = engine.search('author:Alice AND content:test')
result = engine.search('title:doc* AND NOT author:bob')
result = engine.search('word_count:[1000 TO *]')

# Each hit on a content-loaded doc includes located passage snippets
for hit in result.hits:
    for snippet in hit.get("_snippets", []):
        print(f"  ~word {snippet['word']} ({snippet['pct']}%): {snippet['text']}")

LuceneEngine requires only luqum (pip install luqum). No server, no Java, no network.

How the Index Works

Two-Tier Design

The index never pre-loads file content:

Startup (~1 sec)
└── Scan training_data/ filenames
    └── Parse title + author from filename
        └── os.stat() for size
            └── Store metadata dict (no file reads)
                └── Save to index_cache.pkl

When a user clicks Load:

POST /api/documents/{id}/load
└── Read file from disk (once)
    └── Store text in memory
        └── content_loaded = True
            └── Now searchable with content: queries
                └── Results include located passage snippets (_snippets)

Memory Usage

State	RAM
Server start (2,318 books, metadata only)	~3 MB
After loading one typical novel (~400 KB)	~3.4 MB
After loading 10 novels	~7 MB
After loading ALL books (if you wanted to)	~969 MB

Search Logic

Query type	Docs searched	Snippets included
`title:`, `author:`, `word_count:`, etc.	All 2,318 (metadata tier)	No
`content:` or bare terms	Only content-loaded documents	Yes — up to 5 located passages
Mixed (`title:moby AND content:whale`)	Metadata filter applied to all; content filter applied to loaded subset	Yes

Snippet Extraction

For every content-loaded document that matches a query, the engine:

Runs re.finditer() for all query terms across the full document text
Sorts and deduplicates match positions (merging hits within half a snippet-window of each other)
Extracts up to 5 passages of ~280 characters each, walking back to natural word/line boundaries
Records pct (position ÷ document length × 100) and word (position ÷ 5.5, same ratio as word-count estimation) for each passage

Project Gutenberg boilerplate headers are detected and skipped when no match is found in the body text.

Dependencies

Package	Purpose
`fastapi`	REST API framework
`uvicorn`	ASGI server
`luqum`	Lucene query parser (AST)
`python-multipart`	File upload support

No Java. No Elasticsearch. No ML models. No vector databases.

Adding Your Own Documents

Via the UI: drag and drop any .txt, .json, or .csv into the Upload panel.

Via the API:

# Upload a text file
curl -X POST http://localhost:8000/api/upload \
  -F "file=@my_document.txt"

# Index a JSON object directly
curl -X POST http://localhost:8000/api/index \
  -H "Content-Type: application/json" \
  -d '{"title": "My Doc", "author": "Me", "content": "Full text here..."}'

# Bulk index a JSON array
curl -X POST http://localhost:8000/api/index/bulk \
  -H "Content-Type: application/json" \
  -d '[{"title": "Doc 1", "content": "..."}, {"title": "Doc 2", "content": "..."}]'

Uploaded documents have their full content always in memory and are searchable immediately with any query type.

Resetting the Cache

rm index_cache.pkl
python3 main.py   # rebuilds in ~1-2 sec

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Training data is from Project Gutenberg and is in the public domain.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
static		static
tests		tests
training_data		training_data
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
lucene_engine.py		lucene_engine.py
lucet_ui.sh		lucet_ui.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Lucet

Features

Quick Start

Prerequisites

Install & Run

Search Query Syntax

Field Queries (search all 2,318 books instantly)

Content Queries (search loaded books only)

Wildcards

Boolean Operators

Phrase Search

Range Queries

Web UI

REST API

Search

Load a Document's Content

Unload a Document's Content

List Documents

Upload a File

Index JSON Directly

Stats

Unload All Content

Project Structure

Corpus

Use Lucet as a Library

How the Index Works

Two-Tier Design

Memory Usage

Search Logic

Snippet Extraction

Dependencies

Adding Your Own Documents

Resetting the Cache

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages