Skip to content

Stepets/utf8.lua

Repository files navigation

utf8.lua

pure-lua 5.3 regex library for Lua 5.3, Lua 5.1, LuaJIT

This library provides simple way to add UTF-8 support into your application.

Example:

local utf8 = require('.utf8'):init()
for k,v in pairs(utf8) do
  string[k] = v
end

local str = "пыщпыщ ололоо я водитель нло"
print(str:find("(.л.+)н"))
-- 8	26	ололоо я водитель

print(str:gsub("ло+", "보라"))
-- пыщпыщ о보라보라 я водитель н보라	3

print(str:match("^п[лопыщ ]*я"))
-- пыщпыщ ололоо я

Usage:

This library can be used as drop-in replacement for vanilla string library. It exports all vanilla functions under raw sub-object.

local utf8 = require('.utf8'):init()
local str = "пыщпыщ ололоо я водитель нло"
utf8.gsub(str, "ло+", "보라")
-- пыщпыщ о보라보라 я водитель н보라	3
utf8.raw.gsub(str, "ло+", "보라")
-- пыщпыщ о보라보라о я водитель н보라	3

It also provides all functions from Lua 5.3 UTF-8 module except utf8.len (s [, i [, j]]). If you need to validate your strings use utf8.validate(str, byte_pos) or iterate over with utf8.validator.

Please note that library assumes regexes are valid UTF-8 strings, if you need to manipulate individual bytes use vanilla functions under utf8.raw.

Installation:

Download repository to your project folder. (no rockspecs yet)

Examples assume library placed under utf8 subfolder not utf8.lua.

As of Lua 5.3 default utf8 module has precedence over user-provided. In this case you can specify full module path (.utf8).

Configuration:

Library is highly modular. You can provide your implementation for almost any function used. Library already has several back-ends:

Probably most interesting customizations are utf8.config.loadstring and utf8.config.cache if you want to precompile your regexes.

local utf8 = require('.utf8')
utf8.config = {
  cache = my_smart_cache,
}
utf8:init()

For lower and upper functions to work in environments where ffi cannot be used, you can specify substitution tables (data example)

local utf8 = require('.utf8')
utf8.config = {
  conversion = {
    uc_lc = utf8_uc_lc,
    lc_uc = utf8_lc_uc
  },
}
utf8:init()

Customization is done before initialization. If you want, you can change configuration after init, it might work for everything but modules. All of them should be reloaded.

Issue reporting:

Please provide example script that causes error together with environment description and debug output. Debug output can be obtained like:

local utf8 = require('.utf8')
utf8.config = {
  debug = utf8:require("util").debug
}
utf8:init()
-- your code

Default logger used is io.write and can be changed by specifying logger = my_logger in configuration