Skip to content

A pure client-side full-text search engine for static websites, using WebAssembly

License

Notifications You must be signed in to change notification settings

kbumsik/blogsearch

Repository files navigation

BlogSearch

badge
demo

BlogSearch is a blogging tool that enables a search engine without any external services.

This is like DocSearch but for blogs.

More technically, BlogSearch is a pure client-side, full-text search engine for static websites, powered by SQLite compiled to WebAssembly.

Features:
  • Purely client-side search

  • No server to maintain. No service cost.

  • Easy. It’s built for blogs and static websites in mind.

  • Supports popular blog frameworks:

Sister project:
  • sqlite-wasm: Run SQLite on the web, using WebAssembly. This project is made for blogsearch’s needs.

Concepts

Workflow overview

The workflow is consist of two steps: 1. You build an index file .db.wasm, then copy it to the public directory. 2. The engine in the webpage will read the index file and enables the search.

1. Build an index file

2. Enable the search

The index file .db.wasm is a small database file that contains the contents of your website. You can use easy-to-use index building tools:

Then you copy the generated .db.wasm to the public directory (where index.html located) of website.

Your webpage should load the blogsearch engine. There is only one engine available:

Load the engine using <script> tag or in JavaScript file. Once the engine fetch the .db.wasm file correctly, now you have a fully working searchable webpage!

ℹ️
Throughout the project, the terms "index" and "database" are often mixed, but they mean same SQLite .db.wasm file in the most of the case.

1. Building a search index file

Installing an index building tool

What’s in the index file

Users should configure an index building tool to collect the value of fields in order to work the search engine properly.

The index building tool should collect the following default fields for each posts:

fields
  • title: The title of the post.

  • body: The content of the post.

  • url: The URL link to the post.

  • categories: A comma-separated (,) list of categories that the post belongs to.

  • tags: A comma-separated (,) list of tags that the post has.

Users can configure every fields using the following properties:

Table 1. Common options for the field
Example Result

disabled: If set true, completely disable the field.

{
  ...other field options...
  categories: {
+    disabled: true,
  },
}

hasContent: If set false, the index building tool won’t store the value of the field, but still indexes its value. This can be used to reduce the size of a generated index file by the tool. This is useful especially when the size of body field contents is big.

In the following example, the size of the index file .db.wasm is decreased.

{
  ...other field options...
  body: {
+    hasContent: false,
  },
}

indexed: If set false, disable indexing for the field. Its value will still appears in the search result. It is especially useful for url field, whose value is not meaningful for search.

{
  ...other field options...
  url: {
+    indexed: false,
  },
}
ℹ️
Your index building tool may has tool-specific options for the field (e.g. parser option for blogsearch-crawler). See the documentation of your index building tool for details.

2. Enabling the search engine on the web

It’s as simple as:
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/blogsearch@0.0.3/dist/basic.css" />

<script src="https://cdn.jsdelivr.net/npm/blogsearch@0.0.3/dist/blogsearch.umd.js"></script>
<script src="https://cdn.jsdelivr.net/npm/blogsearch@0.0.3/dist/worker.umd.js"></script>

<input id="blogsearch_input_element" type="search" placeholder="Search Text" class="form-control" />

<script>
  blogsearch({
    dbPath: 'your_index_file.db.wasm',
    inputSelector: '#blogsearch_input_element',
  });
</script>

For the further details and options, go to the subdirectory of blogsearch.

QnA

Which search engine technology used in this project?

The search engine basically is SQLite with the FTS5 extension, compiled to WebAssembly. The SQLite FTS5 offers the built-in BM25 ranking algorithm for the search functionality. As SQLite is the most portable database engine, you can open any SQLite database files on the web too! Thanks to SQLite, we can easily write plugins for BlogSearch with just a few SQL queries in different programming languages.

I tried to make it .db but there is a big problem: the index file is not gzip-compressed by the web server. Popular blog web services (especially GitHub Pages) usually serve a .db file as application/octet-stream and do not compress the file. By lying that it is a WebAssembly binary file .wasm, the servers recognize it as application/wasm and ship it compressed.

Compression is important because it significantly reduces the file size. I saw the size is reduced up to 1/3.

Building from source

Workflow

To avoid “But it works on my machine” problem, it is strongly recommended to use Docker for building tasks.

Although this repository is a monorepo where each subprojects has own build scripts, you can easily run tasks in the root directory.

💡
If you want to build a specific subproject only, go to the subdirectory and run yarn commands.

The required tools are the following:

Although it is a JS project Makefile is used because it is much more configuratble and supports building in parallel.

For specific NodeJS versions used in the project, please look at the Dockerfile.

Prepare

# Or yarn install, without docker
make install-in-docker

Build libraries

# Or yarn install, without docker
make lib-in-docker

Run a demo server

make start-in-docker

# You can access the demo page via 0.0.0.0:9000

Testing

# Or make test, without docker
make test-in-docker

# Run it in parallel
make test-in-docker -j4 --output-sync=target

Rebuild example index files

⚠️
This will take a lot of time! (~30 mintues)
# It is highly recommended to use docker here
make examples-in-docker && make demo-in-docker

Build everything

⚠️
This will take a lot of time! (~30 mintues)
# Or make all, without docker
make all-in-docker

# Or

# Parallel builds. This reduces the build time almost an half on my machine.
make all-in-docker -j4 --output-sync=target

Rebuild everything

make clean

# Then run any commands above

Get into a bash session in the container

make bash-in-docker

Credits & License

This project is inspired by DocSearch and has a reimplementation of it in TypeScript.

Other than that, the project is MIT License. See LICENSE