The ICU version number, as a string.
Returns the name of the default text encoding of the process/application.
Convert a Lua string, text, from one encoding (current_encoding) to another (new_encoding).
If either of the encoding parameters are passed as nil, the default encoding is used for that end of the conversion.
"Ustrings" are an alternative type of text string data to standard Lua strings. Internally, a ustring is an array of ICU UChars, stored in a "userdata" structure. The icu.ustring submodule provides ways to convert ustrings to and from standard Lua strings, and to do the same things on a ustring that you can on a Lua string.
Since ustrings and Lua strings are fundamentally different types, if you try to compare a ustring to a Lua string using the == equality operator, the result will always be false, and if you try to compare them using the less-than/greater-than operators, Lua will throw an error. Ustrings can only be compared to other ustrings.
Using icu.ustring(...) as a function is the same as calling icu.ustring.decode(...). A recommended convention for this is:
local U = require 'icu.ustring' local ustr = U"Hello World!"
This is the full list of functions provided by icu.ustring. They can also be called as methods on a ustring instance. Functions in bold are intended to be directly equivalent to a function in the standard string library.
Create a new ustring containing the text encoded in Lua string s, using the encoding specified by encoding (as a Lua string, for example 'windows-1252') or UTF-8 by default.
Encode the text in the given ustring ustr into a Lua string, using the encoding specified by encoding or UTF-8 by default.
Note: Calling the standard Lua function tostring on a ustring will internally call encode with no encoding specified, so the result will be UTF-8 encoded. Also, print uses tostring internally, so printing a ustring will write it UTF-8 encoded to the output.
Returns true if ustring a is less than b in a codepoint-wise comparison, false otherwise.
You can also use a < b instead.
Returns true if ustring a is less than or equal to b in a codepoint-wise comparison, false otherwise.
You can also use a <= b instead.
Similar to icu.ustring.decode, except that unescape does its own run-time backslash-escaping, and supports escapes that Lua normally doesn't such as \x3F and \u0030 for specifying characters using a hexadecimal value. The rest of the text is expected to be encoded by the default codepage.
Returns true if the given value v is a ustring, false otherwise. (This function is necessary as the Lua function type(v) will always return 'userdata' for a ustring.)
The ustring equivalent to string.byte (renamed to be more correct).
The ustring equivalent to string.char.
The ustring equivalent to string.len. Using the length operator # on a ustring will also work.
The ustring equivalent to string.rep.
The ustring equivalent to string.sub.
The ustring equivalent to string.reverse.
The ustring equivalent to string.lower.
The ustring equivalent to string.upper.
The ustring equivalent to string.match.
The character classes used when matching are, by default, still only the ASCII set. For example, %a is still only equivalent to [A-Za-z]. To use the full Unicode-set character classes, there is a new syntax specific to ICU4Lua. Add an exclamation mark ! after %, in this example %!a to match the set of all "letter" characters defined in Unicode.
The ustring equivalent to string.find, with the difference to character classes described in the documentation for icu.ustring.match.
The ustring equivalent to string.gmatch, with the difference to character classes described in the documentation for icu.ustring.match.
The ustring equivalent to string.gsub, with the difference to character classes described in the documentation for icu.ustring.match.
The ustring equivalent to string.format.
A constant value holding the zero-length ustring.
If switching to using ustrings seems like a radical change and you'd rather stay in the domain of standard Lua strings, you might find the icu.utf8 submodule useful. It is like a copy of all of the functions in icu.ustring where every Lua string input parameter is decoded as UTF-8 into a ustring, and every ustring return value is encoded into a UTF-8 Lua string.
As in icu.ustring, functions in bold are intended to be directly equivalent to a function in the standard string library.
Create a UTF-8 string using run-time backslash-escaping. This includes support for escapes that Lua normally doesn't allow, such as \x3F and \u0030 for specifying characters using a hexadecimal value. The rest of the text is expected to be encoded by the default codepage.
Returns true if UTF-8 encoded Lua string a is less than b in a codepoint-wise comparison, false otherwise.
You can use this function as the second parameter of the standard Lua function table.sort to sort an array of UTF-8 encoded strings.
(There's no equivalent function for a >= b - use not icu.utf8.lessthan(a,b) instead.)
Returns true if UTF-8 string a is less than or equal to b in a codepoint-wise comparison, false otherwise.
(There's no equivalent function for a > b - use not icu.utf8.lessorequal(a,b) instead.)
The UTF-8 equivalent to string.byte (renamed to be more correct).
The UTF-8 equivalent to string.char.
The UTF-8 equivalent to string.len. Using the length operator # on a utf8 will also work.
The UTF-8 equivalent to string.rep.
The UTF-8 equivalent to string.sub.
The UTF-8 equivalent to string.reverse.
The UTF-8 equivalent to string.lower.
The UTF-8 equivalent to string.upper.
The UTF-8 equivalent to string.match, with the difference to character classes described in the documentation for icu.ustring.match.
The UTF-8 equivalent to string.find, with the difference to character classes described in the documentation for icu.ustring.match.
The UTF-8 equivalent to string.gmatch, with the difference to character classes described in the documentation for icu.ustring.match.
The UTF-8 equivalent to string.gsub, with the difference to character classes described in the documentation for icu.ustring.match.
The UTF-8 equivalent to string.format.
A string containing the (so-called) UTF-8 byte order mark.
Open a collator for the given locale, which must be given as a Lua string, not a ustring. If the collator could not be opened, returns nil and an error message.
Note: You can also call icu.collator(...) instead of icu.collator.open(...)
Either sets the strength of the collator, or returns the current strength setting if no new value is given. (If a new value is set, the collator itself is returned.)
Valid strength values are:
Returns true if ustring a is equal to ustring b according to the collator col, false otherwise.
Returns true if ustring a is less than ustring b according to the collator col, false otherwise.
Returns true if ustring a is less than or equal to ustring b according to the collator col, false otherwise.
For details on the pattern syntax supported by ICU's regular expressions engine, see the ICU User Guide page at http://userguide.icu-project.org/strings/regexp/.
There are two "levels" of functions available in icu.regex:
A high-level set of "full" operations that people who have used regular expressions are likely to be familiar with - find the first match if any (match), iterate through all matches (gmatch), replace all matches (replace) and split on each match (split).
For these functions, the regex object represents only a compiled pattern.
A low-level set of "atomic" operations. When using these functions, the regex object does not only represent a compiled pattern - it also encapsulates the state of the matching engine:
The low-level matching operations themselves (matches, lookingat, find) only test for a single match at a time, and only return true or false to say whether they succeeded.
These functions are marked as bold in the function list below:
Creates a new compiled regex pattern object. pattern can be a ustring or a Lua string. If a Lua string, it is expected to be encoded in the default codepage.
flags can be one of two things:
Note: You can also call icu.regex(...) as an alternative to icu.regex.compile(...)
Find the first match, or false if there is no match to be found, optionally starting the search at the given start_index (one-based).
For a successful result the returned value is a match object that contains these named fields:
The match object will also have an array component 1 to n, where n is the number of captures in the pattern. These entries will each be either the boolean value false (if the capture is an optional one and was not used) or an object with the same named fields as the parent match object as described above. The 0th element of the match object will always be the match object itself.
Returns an iterator over all of the matches found for a compiled regular expression, designed to be used in a for loop, e.g.:
for match in icu.regex.gmatch(myRegex, inputText) do -- the "match" object is the same as described in the documentation for icu.regex.match end
Find all places where the given regular expression matches in text, replace them with a new value, and return the result.
text must be a ustring, and replacement must be one of the following:
Returns an array of the substrings found by splitting ustring text using the given regex, with an optional maximum number of splits.
Returns true if the value v is a regex object, false otherwise.
Returns a copy of s (which must be either a ustring or a Lua string) with all "special" regex characters (like ., ^, $ etc.) prepended with a backslash.
Returns two values - the original pattern of the regex as a ustring, and the flags that it was compiled with, as a number.
Create a clone of regex. It only clones the compiled pattern, and not any stateful matching information like the target text. It is like a faster, more efficient alternative to icu.regex.compile(icu.regex.decompile(regex)).
Get or set the target text. If a new_value is passed (which must be a ustring) it is set, and the function returns the regex object. If there is no new_value passed, the current text is returned (which might be nil if it was never set).
Get or set the region bounds on the target text. This is a pair of indices that describe where matching should start and stop in an atomic matching operation (unless they are overridden by supplying a start index).
The target text must have been specified using icu.regex.text before calling this function. If new_start and new_stop are specified, they are set as the new region. If not, the current bounds are returned. The indices are one-based and inclusive.
Get or set whether to use transparent bounds. If enable is specified (it must be a boolean value), this is used to set the property, and the regex object is returned. If not, the current setting is returned as a boolean value.
Transparent bounds alter the behaviour of "lookahead" and "lookbehind" captures when the region bounds have been set (see icu.regex.bounds). If the bounds are "transparent", this means that these captures can "look" past the bounds.
By default, transparent bounds are disabled.
Get or set whether to use anchoring bounds. If enable is specified (it must be a boolean value), this is used to set the property, and the regex object is returned. If not, the current setting is returned as a boolean value.
Anchoring bounds alter the behaviour of the ^ and $ anchors. If enabled, the anchors will match at the start and end of the region bounds, wherever they are in the target text. If disabled, they will only match at the start and end of the entire text (or each line, if the relevant flag has been set).
By default, anchoring bounds are enabled.
Atomic match operation that attempts to match the entire matchable region of the target text (which must have been previously set using icu.regex.text) against the whole regex pattern from beginning to end (regardless of anchors in the pattern).
This function will only return true or false to indicate whether the match was successful. Use icu.regex.groupcount, icu.regex.value and icu.regex.range to extract information about the "current" successful match.
If start_index is specified then the matchable region is from this index (one-based) to the end of the whole text, otherwise the matching region will be the one specified by icu.regex.bounds (by default the entire text).
Atomic match operation that attempts to match from the first character of the entire matchable region of the target text (which must have been previously set using icu.regex.text) against the whole regex pattern from beginning to end.
This function will only return true or false to indicate whether the match was successful. Use icu.regex.groupcount, icu.regex.value and icu.regex.range to extract information about the "current" successful match.
If start_index is specified then the matchable region is from this index (one-based) to the end of the whole text, otherwise the matching region will be the one specified by icu.regex.bounds (by default the entire text).
If find has already been called at least once on this regex and successfully found a match, the next time it is called it will start from after the previous match. Attempts to match from the first character of the entire matchable region of the target text (which must have been previously set using icu.regex.text) against the whole regex pattern from beginning to end.
This function will only return true or false to indicate whether the match was successful. Use icu.regex.groupcount, icu.regex.value and icu.regex.range to extract information about the "current" successful match.
If start_index is specified then the matchable region is from this index (one-based) to the end of the whole text, otherwise the matching region will be the one specified by icu.regex.bounds (by default the entire text).
Open a StringPrep profile from a data file. path and filename must be Lua strings. Returns the loaded profile object, or nil and an error message on failure.
Open a StringPrep profile from a predefined profile type. type can be one of:
Prepare the given ustring ustr against the StringPrep profile, profile. Returns a new ustring, or nil and an error message on failure.
Implementation of the ToASCII operation as defined in RFC 3490. label is a ustring that holds a single label (e.g. "www", "lua" or "org") rather than a full domain name. Returns either a new ustring or nil and an error.
options, if specified, should be some combination of these flags:
Combine them with the + operator. The default is neither - i.e. return with an error on unassigned codepoints, and do not check for STD3 violations.
Implementation of the ToUnicode operation as defined in RFC 3490. label is a ustring that holds a single label (e.g. "www", "lua" or "org") rather than a full domain name.
See icu.idna.toascii for details on the flags for options.
Convenience function. An equivalent to icu.idna.toascii that operates on a full domain_name (e.g. "www.lua.org") instead of individual labels.
See icu.idna.toascii for details on the flags for options.
Convenience function. An equivalent to icu.idna.tounicode that operates on a full domain_name (e.g. "www.lua.org") instead of individual labels.
See icu.idna.toascii for details on the flags for options.
Return true if ustrings a and b are equivalent as IDN strings, false otherwise.
Return true if ustring a is less than ustring b as IDN strings, false otherwise.
Return true if ustring a is less than or equal to ustring b as IDN strings, false otherwise.
Open a file for reading/writing according to the mode parameter, which has the same meaning as the mode parameter of io.open() except that the "b" binary mode qualifier has no meaning and should not be used. The encoding and locale to use will be the current defaults if not supplied as parameters.
Returns a new ufile object. (The rest of the icu.ufile functions can also be called as methods on this object.)
Set the encoding to use if new_encoding is specified, otherwise return the current encoding.
Set the locale to use if new_locale is specified, otherwise return the current locale.
The equivalent to file:read(...) on a standard Lua file, with the same options available:
The equivalent to file:write(...) on a standard Lua file, except that the values to write must all be ustrings.
Create a for-loop iterator over lines of text from the given ufile, equivalent to file:lines().
Unlike standard Lua files which allow full seeking operations, ufiles can only have the cursor position returned to the beginning of the file by calling this function.
Save any written data to the ufile.
Close the ufile. This will also occur automatically if the ufile object is garbage collected.
Return a normalized version of ustring ustr. If specified, mode must be one of the following values, passed as a Lua string:
Attempt to perform a quick check to determine whether ustring ustr is already normalized in the given mode (or the same default as icu.normalizer.normalize). This function will not return a boolean as it has three possible states to return, which are the Lua strings 'yes', 'no' and 'maybe' (when it cannot be determined by a quick check).
Perform a full normalization check on ustr, and return a definite true or false.
Where you have two ustrings that are already known to be normalized, and you want to concatenate them, the result may not be normalized. Re-normalizing the entire new string can be inefficient, especially compared to using this function which will normalize only where the two strings join.
When you want to compare two ustrings which have not been normalized but you want to compare them as if they had been, these functions can be a more efficient way than normalizing and then comparing them.