Wednesday, February 22, 2012

Unicode source code

As Frank Salter wrote in a recent comment on case-sensitivity, most scientific fields requires the usage of several dozens of symbols and alphabets that go beyond the latin symbols usable in plain old ASCII, which is a standard almost half-century old (1968).
For example my field - civil and structural engineering - uses almost the entire Greek alphabets in the Eurocodes . 
There is a widespread need to go beyond case-sensitivity and ASCII.
I already expressed at the beginning of our efforts the choice to have Unicode source code as a design choice; shame on me for not having been able to find the time necessary to implement what it is clear in our mind.
We shall - at least - write source code in Unicode and lay down some style guide for its usage. 

First of all we needs more liberal infix and prefix operators. By “liberal” I mean allowing for example Unicode math codes, like many mathematical symbols the Unicode standard defines.
Unicode libraries available as “logicielle libre” (free-as-in-freedom software) - for example Glib - allows to know it a given character is a symbol (i.e.
g_unichar_type (a_char) = G_UNICODE_MATH_SYMBOL; see also this table ).
Of course I wasn’t meaning to allow code like «class PERSON… feature infix "open"», otherwise we will drive the parser crazy ending up with something Perl-like.
This way we would get rid of usual rant of people coming from languages with overloading (i.e. C++, Java) that says:
“why do I have to write
my_matrix.multiply (another_matrix).scalar_vector(a_vector)
instead of (my_matrix*another_matrix)*a_vector?”
Because the mathematician would have rather written
"(my_matrix × another_matrix) ^ a_vector"
scalar and matrix multiplications are not arithmentic multiplication and in fact they actually have different symbols in "real" math.
The infix-prefix name rule could be therefore expressed using Unicode classification of characters.
Actually I would like to write Latex-like code `y := { -b +- sqrt { b^2 - 4 a c} } / {2a} ` or ` `A_x := int_0^1 f_m(x)dx ` in a way similar to what ASCIIMathMl does for HTML pages. But this is currently a dream.

3 comments:

  1. Two remarks:

    1- free operators are pretty, but they don't share the standard priorities of standard operators, so one needs a lot of parentheses, or a way to define operator priorities

    2- unicode requires deep changes in the compiler (least of all the parser), it won't be done anytime soon

    ReplyDelete
  2. Having seen the above comment "... since valid Eiffel source code often resembles a natural language ...", I feel a little history might be informative and hopefully interesting.

    Eiffel continues a long line of language descriptions which started with Algol 60. The most important document on Algol 60 was the "Revised Report on the Algorithmic Language Algol 60" (http://www.masswerk.at/algol60/report.htm).

    The first paragraph of the summary states:
    "The report gives a complete defining description of the international algorithmic language Algol 60. This is a language suitable for expressing a large class of numerical processes in a form sufficiently concise for direct automatic translation into the language of programmed automatic computers."

    The final part of the introduction of the revised report deals with and recognises "three different levels of language", "namely a Reference Language, a Publication Language, and several Hardware Representations". The opening sentence of the "Description of the Reference Language is:
    "As stated in the introduction, the algorithmic language has three different kinds of representations -- reference, hardware, and publication -- and the development described in the sequel is in terms of the language are represented by a given set of symbols -- and it is only in the choice of symbols that the other two representations may differ. Structure and content must be the same for all representations."

    The book "Introduction to ALGOL" was written by R. Baumann, M. Feliciano, F.L. Bauer and K. Samelson (Prentice-Hall, Englewood Cliffs, 1964). The last two co-authors were also co-authors of the revised report. In the first part of the introduction of this book they use the term "assistant" and use it to include calculations carried out by both human and machine: "The computing machine is merely an assistant which has no insight into what it is doing".

    Algol 60 as defined in the revised report contains no input or output instructions. The reference language uses normal mathematical representation for operators. The original text was typewritten (http://www.datamuseum.dk/site_dk/rc/algol/algol60.pdf) with syntactical tokens (reserved words) underlined. Only in printed versions were bold fonts used for these tokens which are now readily represented as begin, end etc.
    I first used Algol in 1965 on the London University Atlas computer and subsequently on the ICL 1900 series machines. Using paper tape input, the programming representation of the language was identical to the reference language with begin and end as brackets with input and output instructions taken from Mercury Autocode. So begin and end (merely identifers) could be used as user defined variables, arrays, labels, switches or procedures.
    On other computers the restrictions of hardware representation produced the restriction on language representation which is common today. Luckily however today's editors provide improved syntactical highlighting to overcome the paucity of modern programming languages basic symbols.

    I believe that you can see that your observation about resembling natural language is no accident but the result of careful analysis by a group of farsighted individuals 50 years ago. So now hardware representation can be as good as that of publication, I hope you will succeed in implimenting unicode and add an exception to C. A. R. Hoare's remark: "Here is a language so far ahead of its time that it was not only an improvement on its predecessors but also on nearly all its successors."
    My best wishes in this endevour.
    PS I note from the preview that the bold forms of begin and end I TRIED TO REPRESENT appear to be lost, but hepefully not in the to be published form!

    ReplyDelete
    Replies
    1. This comment is a gem!
      It's way better than the article it comments and I think it should be properly published.

      Delete