KIF: BNF

pat hayes phayes at ai.uwf.edu
Thu Jun 28 12:54:49 CDT 2001


KIF Syntax

An ISO- KIF document is encoded as a sequence of Unicode characters ( 
UCS-2, from the Unicode 'base multilingual plane' according to 
ISO/IEC 10646-1:1999). The standard encoding is UTF-8 (ISO 
10646-1:2000 Annex D ) .
Only characters in the US-ASCII subset are reserved for special use 
in ISO-KIF, so that SUO-KIF can be encoded as an ASCII text string if 
required. UCS-2 characters outside that range are represented in 
ASCII text by a character coding sequence of the form \unnnn or 
\unnnnnn where n is a hexadecimal digit character. When transforming 
an ASCII text string to UTF-8, such a sequence should be replaced by 
the corresponding UTF-8 byte encoding.

The rest of this document uses the ASCII rendering.

Lexicalization

The syntax of KIF is designed to be easy to process mechanically. The 
syntax is defined in terms of disjoint blocks of characters called 
lexical items or lexemes . A character stream can be converted into a 
stream of lexemes by a simple recursive process of lexicalisation 
which checks for a small number of interlexical characters, which 
indicate the end of a lexical item (and sometimes the start of the 
next one.) The interlexical characters are:
" 	U+0022
( 	U+0028
)	U+0029
<space>	U+0020
<tab>	U+0009
<line>	U+000A
<page>	U+000C
<return>   U+000D

(Note that the backslash character  \ (reverse solidus U+005C) also 
plays a role in lexicalization, by  cancelling the interlexicality of 
the immediately following character.)

Certain characters are reserved for special use as the first 
character in a lexical item. Most lexical items can be assigned a 
unique syntactic role solely on the basis of their first character, 
called the marking character; this role is then sufficient to enable 
a parser to construct the syntactic form of the KIF expression. The 
marker characters are:
  <             U+003C
" 	U+0022
( 	U+0028
)	U+0029
@	U+0040
# 	U+0023
?	U+0035
%             U+0025
`	U+0060
=	U+003D
/	U+002F
[	U+005B
:               U+003A
(Note. Not all of the marking characters are included in the lexical 
class <marker>)
(Note. The <  (U+003C)  and \ (reverse solidus U+005C) characters are 
reserved for special use.
< is reserved for use in an XML coding of ISO-KIF.
\ plays a special role in enabling "out of code" characters to be 
included in KIF ASCII text. Followed by the lowercase letter u 
(U+0075) and a four- or six-digit hexadecimal code, it is used to 
transcribe non-ASCII UTF-8 characters, as explained above. Any string 
of this form plays the same syntactic role in an ASCII string 
rendering as a single ordinary character. Backslash is also used as a 
cancelling prefix. The special roles of any of the interlexical or 
marking characters, and of the backslash character itself, can be 
cancelled by preceding the character with a backslash. This enables 
marking and interlexical characters to be included in lexical items 
without writing their Unicode equivalents.

In the following BNF, angle brackets <> enclose syntactic class 
names, | indicates alternatives, * indicates any sequence of 
expressions, + indicates any nonempty sequence, square brackets [ ] 
indicate an optional sequence, and curly brackets { } are used as 
grouping characters. To avoid ambiguity, <plus>, <asterisk>, and 
<vertical> are used to indicate the characters +  , * and | .

The BNF also applies to UTF-8 encodings, with the change noted to the 
category <nonascii>.

1.  Character classification

<white>		::= <space>|<tab>|<line>|<page>|<return>
<lexbreak>	::= <white>|(|)|"|\
<alpha>		::= A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|
			a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
<digit>		::= 0|1|2|3|4|5|6|7|8|9
<marker>	                ::= @|#|?|`|.|=|/|[|:|%
<other>		::= !|$|%|&|<asterisk>|<plus>|,|-|;|]|^|_|{|}|<vertical>|~|>|'

2. Lexical analysis

<hexa>		::= <digit>|A|B|C|D|E|F
<nonascii>	                ::= 
\u<hexa><hexa><hexa><hexa>|\u<hexa><hexa><hexa><hexa><hexa><hexa>

Note.  For UTF-8 strings, <nonascii> is the set of all  non-control 
characters of UTF-8 outside the ASCII range, ie all non-control 
characters in the range U+007F-FFFFFD. The use of the \uXXXX sequence 
in UTF-8 character strings, while not illegal, is DEPRECATED.

<insert>		::= \<lexbreak>|<nonascii>
<wordchar>	::= <alpha>|<digit>|<marker>|<other>|<insert>
<stringchar>	::= <wordchar>|<white>|(|)
<docstring>	::= "<stringchar>*"

Note. A docstring can contain any character, but quotemark "  or 
backslashh can occur only in the combination /" or //. Docstrings are 
the only lexical items that can contain whitespace. They are used 
solely for documentation purposes and are considered to be 'atomic' 
lexical items with no logical interpretation or syntactic structure. 
They can be used to attach any kind of 'external' information to a 
KIF expression.

<lexitem>	 ::= <wordchar>*|(|)|<docstring>

Note. <lexitem> is an "umbrella" category which includes all lexical items.
The job of a lexical analyser is to divide the input character stream 
into a stream of non-overlapping lexical items, separated by 
<lexbreak> characters.

The rest of the syntactic specification is understood to refer only 
to the stream of lexical items, so that any whitespace (not included 
in an <insert>) is ignored. It defines a finer classification of 
lexical items into distinct lexical subclasses.

3. Lexical syntax

<relname>	            ::=   =|/=|
<connective>         ::= and|or|iff|implies|not
<wordbody>          ::= <wordchar>+
<indvar>	            ::= ?<wordbody>
<word>	             ::=  {<alpha>|<other>|<insert>}<wordchar>*
<specialname>         ::= %<wordbody>|documentation
<restriction>         ::= :<word>                //not a typo :-) //
<seqvar>	             ::= @<wordbody>
<importname>       ::= #<wordbody>
<quote>	            ::= `<wordbody>'
<numeral>	            ::= <digit>*

Note. A word is any lexical item that does not start with a marking 
character (though it may include them.)
Numerals should not be used as words, to avoid confusion with their 
use in ontologies which refer to numbers.
<specialname> is a name form reserved for 'private use' by other 
systems which may wish to attach information to KIF expressions in 
other formats (such as timestamping and authorship information).
The KIF core uses only <relname>, <connective>, <indvar> , <word> , 
<reservename> and <formname>; the other lexical forms have special 
meanings in various KIF extensions, so their use in the KIF core is 
PROHIBITED in order to maintain compatibility with the extensions. 
Seqvars are used in the sequence-quantifier extension; importnames 
are used in the namespace extension; quotes are used in the meta 
extension. Note that quotes are distinct from documentation strings.

4. Core expression syntax //read Notes and comments below before exploding.//

Note. All KIF expressions are written in LISP-style 'S-expression 
syntax'; they consist entirely of nested pairs of parentheses 
enclosing the logical subexpressions of the expression in question. 
Hence, KIF expressions can be encoded directly as LISP data 
structures. However, KIF does not require that expressions must be 
encoded in this way.
Note. The expression syntax will be modified in  some of the extensions.

<term>	            ::= <indvar> |<name>|(<term>+)|(term 
<specialname>*)|(<term> <docstring>)
<eqsent>	            ::= (= <term> <term>)
<ineqsent>	            ::= (/= <term> <term>)
<atomsent>           ::= (<term>+)|<eqsent>|<ineqsent>
<boolsent>	            ::= ({and|or} <sent>*) | ({implies|iff} 
<sent><sent>)|(not <sent>)
<quantsent>           ::= ({forall|exists} (<var>*) <sent>)
<sent>	            ::= <atomsent>|<boolsent>|<quantsent> |(<sent> <docstring>)
<docform>	            ::= (documentation [<word>] <docstring>)

-------

Notes and comments

This BNF is rather persnickety, but it has the merit of being 
absolutely watertight in the way it handles whitespace and 
lexicalization (unlike previous KIF BNFs, which all do some 
handwaving.)  Ive tried to organize it logically: all the 
lexicalization and character-streaming issues, and the Unicode/ASCII 
translations, are handled in parts 1 and 2; part 3 classifies lexical 
items, and then all the actual syntax is kept in one place. Only part 
4 should need to be altered in any extension.

I have deliberately kept the form of <words> as  lexically 
unrestricted as possible, to allow people to use things like URIs as 
words. This requires being rather 'generous' with allowing characters 
into words, but that is fine as long as we have some way to put them 
into categories, and since we use ? and @ for variable markers, I 
decided to try to keep this idea going by trying to do it all by the 
choice of the first character.

I used a % prefix rather than the post/prefix dots to indicate 
'external' words with a special non-KIF meaning, but if anyone 
objects to this convention we could replace it with some other 
convention, or omit it altogether from the core (though I think it 
would be pity to omit it as it is quite  harmless and may solve some 
very big headaches for some users.)

Keeping < as a marking character is just a way to stop it being the 
first item in any word, as that would break XML. (In XML one uses 
&lt; to escape the difficulty.)

The expression syntax is about as free as it can possibly be: there 
is no distinction between function, constant or relation symbols, and 
variables can appear anywhere. It would be easy to put those 
classifications back in if people want them there.

The syntax includes a proposal which I think it would be potentially 
useful and relatively harmless to include.  This is what I was 
talking about as 'wrappers' in the telecon.
The last clauses in <term> and <sent> allow comments to be 'wrapped' 
around any subexpression of a KIF expression; the idea is that these 
will  be semantically invisible.  So for example you could write
(forall  (?x ?y)(iff ((R ?x ?y) "why is this called R? -Pat 3/18") (R ?y ?x)))
which would have the exact same meaning as  what you would get by 
stripping out the comment syntax:
(forall  (?x ?y)(iff (R ?x ?y)(R ?y ?x)))

In the same spirit I've included documentation forms (not 
'expressions' since they have no semantics) in the syntax, with or 
without names.  To 'attach' an documentary string to an axiom, just 
write
( <the axiom>
  "documentation: yadahyadah")
which is  logically indistinguishable from writing the axiom by itself.

I have omitted restricted quantifier forms from this syntax, but it 
would be easy to put them back in if people want them. I think they 
are harmless and useful.  Change <quantsent> to:
<quantsent> ::= ({forall|exists} ({<var>[<restriction>]}* <sent>*) <sent>)
We are going to need restrictions in the sorted extension in any case.

---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes at ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 11320 bytes
Desc: not available
Url : http://philebus.tamu.edu/pipermail/kif/attachments/20010628/efea82ff/attachment.bin


More information about the Kif mailing list