BlogMe3 - an extensible language for generating HTML

Quick
index
main
eev
eepitch
maths
angg
blogme
dednat6
littlelangs
PURO
(C2,C3,C4,
λ,ES,
GA,MD,
Caepro,
textos,
Chapa 1)

emacs
lua
(la)tex
maxima
qdraw
git
lean4
agda
forth
squeak
icon
tcl
tikz
fvwm
debian
irc
contact
☿

BlogMe3 - an extensible language for generating HTML

(2013feb10: this is a mess! The rewrite of blogme3 into something much cleaner (BlogMe4) is essentially ready, but I have not yet checked if all my 250+ .blogme files would work with it...)

Quick index:

1. Basic concepts
1.1. Parsing (and pos and subj)
1.2. Evaluation
1.3. Argument parsers
1.4. The core and the angg files
1.5. Invoking blogme3.lua
2. Ancestors
2.1. Forth
2.2. Lisp
2.3. Tcl
2.4. TH
3. The source files
3.1. brackets.lua: the parsers (_A and _B)
3.2. definers.lua: def and DEF (_AA)
3.3. escripts.lua: htmlizelines (_E)
3.4. elisp.lua: makesexphtml (_EHELP, _EBASE, etc)
3.5. dooptions (_O)
4. Introduction
4.1. How the language works
4.2. How []-expressions are evaluated
4.3. Defining new words in Lua with def
5. The internals of blogme2.lua:
5.1. The main tables used by the program
5.2. Blogme words (the tables _W and _A)
5.3. The blogme parsers (the table _P)
5.4. Files
5.5. Help needed
5.6. Etc

1. Basic concepts

1.1. Parsing (and pos and subj)

Some of the most fundamental functions in the code of BlogMe are "parsers". They all try to parse a pattern in the "subject string" stored in the global variable "subj", starting at the position stored in the global variable "pos" (the names "subj" and "pos" come from Icon).

(find-iconbookpage (+ 22 37))
(find-iconbookpage (+ 22 44))

On success these patterns advance "pos" and return some non-nil value; on failure they keep "pos" unchanged, and return nil.

Let's fix some terminology. Consider the grammar below; we will refer to these "patterns" by the names of the "non-terminal symbols", at the left of the "::=" signs.

spacechar   ::= ' ' | '\t' | '\n'
normalchar  ::= any char except ' ', '\t', '\n'
wordchar    ::= any char except ' ', '\t', '\n', '[', ']'
optspaces   ::= spacechar*
spaces      ::= spacechar+
wordchars   ::= wordchar+
normalchars ::= normalchar+
block       ::= '[' (normalchars | block)* ']'
bigword     ::= (wordchars | block)+
rest        ::= (normalchars | block)*

All the "*"s and "+"s above are greedy.

Parsers for all these symbols except "bigword" and "rest" can be implemented using just {lua patterns}; the code is {here}.

As a curiosity, note that "parseblock" could be implemented with Lua's "balanced pattern" operator, as "%b[]" - but instead of doing that we use a table that tells for each "[" or "]" in "subj" where is the corresponding "]" or "[". The code is {here} and {here}.

(find-blogme3file "brackets.lua" "setsubjto =")
  (find-blogme3file "brackets.lua" "bracketstructure =")

1.2. Evaluation

Parsing, of course, is not enough - what really matters is that certain "expressions" can be "evaluated". For example, if we evaluate the following string as a "vblock" ("vblock" stands for the "value of a block"),

[HREF http://foo/bar/ foo[+ 2 3]bar ploc]

we get:

<a href="http://foo/bar/">foo5bar ploc</a>

Let's follow in details what is happening here. To evaluate a block, we first parse its "head word" - HREF, in this case - and then we parse the "argument list" for that word; how to parse the argument list depends on the word, as in Lisp and Forth; we will see the details very soon - and then we call the "blogme code" associated to HREF, with those arguments; in the case of HREF its blogme code is stored in a global Lua function with the same name, and so this code is called as:

HREF("http://foo/bar/", "foo5bar ploc")

which returns:

<a href="http://foo/bar/">foo5bar ploc</a>

Note that to generate the second argument, "foo5bar ploc", we had to evaluate the block "[+ 2 3]"; the result was the result of calling the blogme code for "+" with arguments "2" and "3".

1.3. Argument parsers

Just as the blogme code for "HREF" was stored in Lua's table of globals (_G), the code for the argument parser for "HREF" was stored in _A["HREF"]. In the case of HREF, the argument parser function is "readvvrest", whose definition is:

readvvrest = function () return readvword(), readvrest() end

the "readers" are like the "parsers", but they run "parsespaces()" before them, and when they "fail" they do not move pos back to before the spaces, and the return the empty string instead of nil.

Now here are the exact rules for evaluating a block (in pseudocode, and without any error-checking):

word = getvword()
(_B[word] or _G[word])(_A[word]())

Note that the table _B is checked before _G - this is to allow us to have blogme words with the same names as Lua functions, but whose blogme code is different from the lua function with the same name.

When we evaluate a vword or a vrest we may have to concatenate several partial results - some from parsing "words" or "normalchars", some from parsing "blocks" - to form the final result. The convention (the code is {here}) is that when we only have one partial result coming from a block, then it is not transformed in any way - this lets us have blocks that return, say, Lua tables. For example, with the right (obvious) definitions for "print", "expr:", and "+", this

[print [expr: {2, 3}] [+ 22 33]]

would print the same output as:

print({2, 3}, 55)

1.4. The core and the angg files

1.5. Invoking blogme3.lua

If we are at /tmp, and there's a file /tmp/blogme whose contents are

[lua: print("Hello!")
      PP(arg)
]
[htmlize [J Foo Bar]
  [P A paragraph]
]
Blah

and we invoke blogme3.lua with arguments "-o foo.html -i foo.blogme" , we will see something like this,

/tmp# lua51 ~/blogme3/blogme3.lua -o foo.html -i foo.blogme
Hello!
 {-1="lua51", 0="/home/edrx/blogme3/blogme3.lua",
   1="-o", 2="foo.html", 3="-i", 4="foo.blogme"}
/tmp# cat foo.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Foo Bar</title>
</head>
<body>
<h3>Foo Bar</h3>
<p>A paragraph</p>
</body>
</html>
/tmp#

Let's understand what happened.

The first thing that blogme3.lua does is to extract from arg[0] the directory where blogme3.lua resides, and add it to the path (the code is here; then it loads some files, with

-- (find-blogme3file "blogme3.lua")
require "brackets"    -- (find-blogme3 "brackets.lua")
require "definers"    -- (find-blogme3 "definers.lua")
require "charset"     -- (find-blogme3 "charset.lua")
require "anggdefs"    -- (find-blogme3 "anggdefs.lua")
require "elisp"       -- (find-blogme3 "elisp.lua")
require "options"     -- (find-blogme3 "options.lua")

then it processes the command-line arguments.

For each recognized command-line argument there is an entry in the table _O - defined in options.lua - that describes how to process that option; for example, for "-i" we have this:

-- (find-blogme3 "")
_O["i"] = dooptions_i
dooptions_i = function () ... end

The loop that processes the options is this simple recursion, in blogme3.lua:

The argument following "-o" is the name of the output file; as we shall see (in sec ___), some setup actions can only be performed after "-o" - for example, all definitions that depend on the base directory for relative links.

The option "-i" treats the argument following it as the name of an input file to be evaluated "in the normal way"; the contents of foo.blogme are evaluated as a "vrest", and the result of this is discarded (that's why the "Blah" at the end of foo.blogme disappeared!), but the contents of the global variable _output are written to the output file, whose name is the global variable _. If either outfile or the outcontents were empty we would get an error - but htmlize treated its first argument ("Foo bar") as the title of the html page ("[J Foo bar]" -> "Foo bar"), used the "rest" of its arguments ("[P A paragraph]") as the body of the html, wrapped that within html headers, and stored the result in outcontents.

The word "lua:"

(...)

A more precise description

The core of Blogme is made of a parser that recognizes a very simple language, and an interpreter coupled to the parser; as the parser goes on processing the input text the interpreter takes the outputs of the parser and interprets these outputs immediately.

This core engine should the thought as if it had layers. At the base, a (formal) grammar; then functions that parse and recognize constructs from that grammar; then functions that take what the parser reads, assemble that into commands and arguments for those commands, and execute those commands.

I think that the best way to describe Blogme is to describe these three layers and the implementation of the top two layers - the grammar layer doesn't correspond to any code. Looking at the actual code of the core is very important; the core is not a black box at all - the variables are made to be read by and changed by user scripts, and most functions are intended to be replaced by the user eventually, either by less simplistic versions with more features, or, sometimes, by functions only thinly connected to the original ones.

2. Ancestors

I know that it sounds pretentious to say that, but it's true... Blogme descends from three important "extensible" programming languages - Forth, Lisp, and Tcl - and from several

The design of Blogme was inspired mainly by - or borrows ideas from - Forth, Lisp, and Tcl.

2.1. Forth

This is a Forth program that prints "3 Hello20":

1 2 + . ." Hello" 4 5 * .

Forth reads one word at a time and executes it immediately (sometimes it "compiles" the word instead of running it, but we can ignore this now). `.' is a word that prints the number at the top of the stack, followed by a space; `."' is a word that prints a string; it's a tricky word because it interferes on the parsing to get the string to be printed. I've always thought that this permission to interfere on the parsing was one of Forth's most powerful features, and I have always thought about how to implement something like that - maybe as an extension - on other languages.

So - the Forth interpreter (actually the "outer interpreter" in Forth's jargon; the "inner interpreter" is the one that executes bytecodes) reads the word `."', and then it calls the associated code to execute it; at that point the pointer to the input text - let's call it "pos" - is after the space after the `."', that is, at the `H'; the code for `."' advances pos past the `Hello"' and prints the "Hello", after that the control returns to the outer interpreter, who happilly goes on to interpret "4 5 * .", without ever touching the 'Hello"'.

2.2. Lisp

In Lisp all data structures are built from "atoms" (numbers, strings, symbols) and "conses"; a list like (1 2 3) is a cons - a pair - holding the "first element of the list", 1, and the "rest of the list", which is the cons that represents the list (2 3). Trees are also built from conses and atoms, and programs are trees - there is no distinction between code and data. The Lisp parser is very simple, and most of the semantics of Lisp lies in the definition of the "eval" function. The main idea that I borrowed from Lisp's "eval" is that of having two kinds of evaluation strategies: in

(* (+ 1 2) (+ 3 4))

the "*" is a "normal" function, that receives the results of (+ 1 2) and (+ 3 4) and returns the result of multiplying those two results; but in

(if flag (message "yes") (message "no"))

the "if" is a "special form", that receives its three arguments unevaluated, then evaluates the first one, "flag", to decide if it is going to evaluate the second one or the third one.

2.3. Tcl

(3) Tcl. In Tcl the main data structure is the string, and Tcl doesn't even have the distinction that Lisp has between atoms and conses - in Tcl numbers, lists, trees and program code are just strings that can be parsed in certain ways. Tcl has an evaluation strategy, given by 11 rules, that describes how to "expand", or "substitute", the parts of the program that are inside ""s, []s, and {}s (plus rules for "$"s for variables, "#"s for comments, and a few other things). The ""-contexts and []-contexts can nest inside one another, and what is between {}s is not expanded, except for a few backslash sequences. In a sense, what is inside []s is "active code", to be evaluated immediately, while what is inside {}s is "passive code", to be evaluated later, if at all.

Here are some examples of Tcl code:

set foo 2+3
set bar [expr 2+3]
puts $foo=$bar                 ;# Prints "2+3=5"

proc square {x} { expr $x*$x }
puts "square 5 = [square 5]"   ;# Prints "square 5 = 25"

2.4. TH

Blogme descends from a "language" for generating HTML that I implemented on top of Tcl in 1999; it was called TH. The crucial feature of Tcl on which TH depended was that in ""-expansions the whitespace is preserved, but []-blocks are evaluated. TH scripts could be as simple as this:

htmlize {Title of the page} {
  [P A paragraph with a [HREF http://foo/bar/ link].]
}

but it wasn't hard to construct slightly longer TH scripts in which a part of the "body of the page" - the second argument to htmlize - would become, say, an ASCII diagram that would be formatted as a <pre>...</pre> block in the HTML output, keeping all the whitespace that it had in the script. That would be a bit hard to do in Lisp; it is only trivial to implement new languages on top of Lisp when the code for programs in those new languages is made of atoms and conses. I wanted something more free-form than that, and I couldn't do it in Lisp because the Lisp parser can't be easily changed; also, sometimes, if a portion of the source script became, say, a cons, I would like to be able to take this cons and discover from which part of the source script that cons came... in Blogme this is trivial to do, as []-blocks in the current Blogme scripts are represented simply by a number - the position in the script just after the "[".

3. The source files

3.1. brackets.lua: the parsers (_A and _B)

3.2. definers.lua: def and DEF (_AA)

3.3. escripts.lua: htmlizelines (_E)

3.4. elisp.lua: makesexphtml (_EHELP, _EBASE, etc)

3.5. dooptions (_O)

(2007apr18: Hey! The rest of this page refers to BlogMe2, that is obsolete... I just finished rewriting it (-> BlogMe3), but I haven't had the time yet to htmlize its docs...)

(2005sep28: I wrote this page in a hurry by htmlizing two of blogme's documentation files, README and INTERNALS, which are not very clean...)

4. Introduction

The "language" that blogme2.lua accepts is extensible and can deal with input having a lot of explicit mark-up, like this,

[HLIST2 Items:
  [HREF http://foo/bar a link]
  [HREF http://another/link]
  [IT Italic text]
  [BF Boldface]
]

and conceivably also with input with a lot of implicit mark-up and with control structures, like these examples (which haven't been implemented yet):

[BLOGME
  Tuesday, February 15, 2005

  I usually write my notes in plain text files using Emacs; in
  these files "["s and "]"s can appear unquoted, urls appear
  anywhere without any special markup (like http://angg.twu.net/)
  and should be recognized and htmlized to links, some lines are
  dates or "anchors" and should be treated in special ways, the
  number of blank lines between paragraphs matter, in text
  paragraphs maybe _this markup_ should mean bold or italic, and
  there may be links to images that should be inlined, etc etc
  etc.
]

[IF LOCAL==true
    [INCLUDE todo-list.blogme]
]

BlogMe also support executing blocks of Lua code on-the-fly, like this:

[lua:
   -- We can put any block of Lua code here
   -- as long as its "["s and "]"s are balanced.
]

4.1. How the language works

BlogMe's language has only one special syntactical construct, "[...]". There are only have four classes of characters "[", "]", whitespace, and "word"; "[...]" blocks in the text are treated specially, and we use Lua's "%b[]" regexp-ish construct to skip over the body of a "[...]" quickly, skipping over all balanced "[]" pairs inside. The first "word" of such a block (we call it the "head" of the block) determines how to deal with the "rest" of the block.

To "evaluate" an expression like

[HREF http://foo/bar a link]

we only parse its "head" - "HREF" - and then we run the Lua function called HREF. It is up to that function HREF to parse what comes after the head (the "rest"); HREF may evaluate the []-expressions in the rest, or use the rest without evaluations, or even ignore the rest completely. After the execution of HREF the parsing resumes from the point after the associated "]".

4.2. How `[]`-expressions are evaluated

Actually the evaluation process is a bit more subtle than than. In the last example, BlogMe doesn't just execute HREF(); it uses an auxiliary table, _A, and it executes:

HREF(_A["HREF"]())

_A["HREF"] returns a function, vargs2, that uses the rest to produce arguments for HREF. Running vargs2() in that situation returns

"http://foo/bar", "a link"

and HREF is called as HREF("http://foo/bar", "a link"). So, to define HREF as a head all we would need to do ("would" because it's already defined) is:

HREF = function (url, text)
    return "<a href=\""..url.."\">"..text.."</a>"
  end
_A["HREF"] = vargs2

4.3. Defining new words in Lua with `def`

Defining new heads is so common - and writing out the full Lua code for a new head, as above, is so boring - that there are several tools to help us with that. I will explain only one of them, "def":

def [[ HREF 2 url,text  "<a href=\"$url\">$text</a>" ]]

"def" is a lua function taking one argument, a string; it splits that string into its three first "words" (delimited by blanks) and a "rest"; here is its definition:

restspecs = {
  ["1"]=vargs1,    ["2"]=vargs2,    ["3"]=vargs3,    ["4"]=vargs4,
  ["1L"]=vargs1_a, ["2L"]=vargs2_a, ["3L"]=vargs3_a, ["4L"]=vargs4_a
}
def = function (str)
    local _, __, name, restspec, arglist, body =
      string.find (str, "^%s*([^%s]+)%s+([^%s]+)%s+([^%s]+)%s(.*)")
    _G[name] = lambda(arglist, undollar(body))
    _A[name] = restspecs[restspec] or _G[restspec]
      or error("Bad restspec: "..name)
  end

The first "word" ("name") is the name of the head that we're defining; the second "word" ("restspec") determines the _GETARGS function for that head, and it may be either a special string (one of the ones registered in the table "restspecs") or the name of a global function.

5. The internals of blogme2.lua:

5.1. The main tables used by the program

_G: Lua's table of globals (rmt)
_W: blogme words
_P: low-level parsers
_A: argument-parsing functions for blogme words
_AA: abbreviations for argument-parsing functions (see `def')
_V: blogme variables (see "$" and `withvars')

5.2. Blogme words (the tables _W and _A)

(Source code: the function `run_head', at the end of blogme2-inner.lua.)

Let's examine an example. When blogme processes:

[HREF http://foo bar]

it expands it to:

<a href="http://foo">bar</a>

When the blogme evaluator processes a bracketed expression it first obtains the first "word" of the brexp (called the "head" of the brexp), that in this case is "HREF"; then it parses and evaluates the "arguments" of the brexp, and invokes the function associated to the word "HREF" using those arguments. Different words may have different ways of parsing and evaluating their arguments; this is like the distinction in Lisp between functions and special forms, and like the special words like LIT in Forth. Here are the hairy details: if HREF is defined by

HREF = function (url, str)
    return "<a href=\""..url.."\">"..str.."</a>" end
_W["HREF"] = HREF
_A["HREF"] = vargs2

then the "value" of [HREF http://foo bar] will be the same as the value returned by HREF("http://foo", "bar"), because

_W["HREF"](_A["HREF"]())

will be the same as:

HREF(vargs2())

when vargs2 is run the parser is just after the end of the word "HREF" in the brexp, and running vargs2() there parses the rest of the brexp and returns two strings, "http://foo" and "bar".

See: (info "(elisp)Function Forms")
and: (info "(elisp)Special Forms")

5.3. The blogme parsers (the table _P)

(Corresponding source code: most of blogme2-inner.lua.)

Blogme has a number of low-level parsers, each one identified by a string (a "blogme pattern"); the (informal) "syntax" of those blogme patterns was vaguely inspired by Lua5's syntax for patterns. In the table below "BP" stands for "blogme pattern".

BP    Long name/meaning      Corresponding Lua pattern
 -----+----------------------+--------------------------
 "%s" | space char           | "[ \t\n]"
 "%w" | word char            | "[^%[%]]"
 "%c" | normal char          | "[^ \t\n%[%]]"
 "%B" | bracketed expression | "%b[]"
 "%W" | bigword              | "(%w*%b[]*)*" (but not the empty string!)

The low-level parsing functions of blogme are of two kinds (levels):

Functions in the "parse only" level only succeed or fail. When they succeed they return true and advance the global variable `pos'; when they fail they return nil and leave pos unchanged (*).
Functions in the "parse and process" level are like the functions in the "parse only" level, but with something extra: when they succeed they store in the global variable `val' the "semantic value" of the thing that they parsed. When they fail they are allowed to garble `val', but they won't change `pos'.

See: (info "(bison)Semantic Values")

These low-level parsing functions are stored in the table `_P', with the index being the "blogme patterns". They use the global variables `subj', `pos', `b', `e', and `val'.

An example: running _P["%w+"]() tries to parse a (non-empty) series of word chars starting at pos; running _P["%w+:string"]() does the same, but in case of success the semantic value is stored into `val' as a string -- the comment ":string" in the name of the pattern indicates that this is a "parse and process" function, and tells something about how the semantic value is built.

(*): Blogme patterns containing a semicolon (";") violate the convention that says that patterns that fail do not advance pos. Parsing "A;B" means first parsing "A", not caring if it succeds or fails, discarding its semantic value (if any), then parsing "B", and returning the result of parsing "B". If "A" succeds but "B" fails then "A;B" will fail, but pos will have been advanced to the end of "A". "A" is usually "%s*".

5.4. Files

(To do: write this stuff, organize.)

Files:

Its main directory: blogme/.
Its README, and a description of its INTERNALS.
Its source code:
The BlogMe source for this page.
The BlogMe source for my math page.
The BlogMe source for the navigation bar thing.

There is no .tar.gz yet (coming soon!).

5.5. Help needed

Lua seems to be quite popular in the M$-Windows world, but I haven't used W$ for anything significative since 1994 and I can't help with W$-related questions. If you want to try BlogMe on W$ then please consider writing something about your experience to help the people coming after you.

5.6. Etc

A BlogMe mode for emacs and a way to switch modes quickly (with M-m).

A note on usage (see the corresponding source code):

blogme2.lua -o foo.html -i foo.blogme

This behaves in a way that is a bit unexpected: what gets written to foo.html is not the result of "expanding" the contents of foo.blogme - it's the contents of the variable blogme_output. The function (or "blogme word") htmlize sets this variable. Its source code is here.

History: BlogMe is the result of many years playing with little languages; see this page. BlogMe borrowed many ideas from Forth, Tcl and Lisp.

How to get in touch with the author.

A test (2007apr26):

#*
# (find-localcfile "foo/")
# (find-localcfile "foo/"    "ignored")
# (find-localc     "foo/"    "should be ignored?")
# (find-localc     "foo/bar" "becomes a tag")
# (find-localcw3m  "foo/bar.html#tag" "ignored")
# (find-remotecfile "foo/")
# (find-remotecfile "foo/"    "ignored")
# (find-remotec     "foo/"    "should be ignored?")
# (find-remotec     "foo/bar" "becomes a tag")
# (find-remotecw3m  "foo/bar.html#tag" "ignored")