Cxy Syntax Updates

Having a complex syntax means having a complex parser. A complex parser implies more time spent creating and maintaining. More maintenance may leave bugs alive for a longer period. More bugs means more instability. More instability means less predictability and trustworthiness. This results in a nuisance to the programmer. An undesirable trait.

 

I have chosen to adhere to some standards that I think make the language both easier to read and write. Not to mention that the syntax is very simple to parse. The Cxy standard has some familiar concepts seen in other programming languages as well.

  1. Array/Tuple Notation

    Validation

    It has proven to be quite useful to create arrays or tuples on the fly. This can be observed in some object-oriented languages. An example is C++, where tuple unpacking can swap multiple elements. We can also create an “initializer list” which follows a {} convention. The same can be done in Java. This allows one to painlessly create single or multidimensional arrays.

    Cxy Implementation

    The syntax for arrays in Cxy is not so different. The [] tokens may be used for this purpose. However, Cxy has one remark to make. If elements are consecutively listed and separated using a comma, then this will evaluate to an array without the brackets. The brackets are only necessary when designating a nested array. As an example:

    public x = 1, 3, 4, [9, 0, 4], 3, [0], 0;

    x itself contains only 7 distinct objects. 2 of these objects are arrays. The rest of these objects are numbers. The first subarray at position 3 (remember, indexing starts at 0), contains 3 elements. The array at position 5 contains 1 element. A bracket around this entire assembly would make x an array of size 1, containing another array. This solves the pre-tuple pre-parser which would be necessary if we were to use parenthesis as tuple/array notation. In addition, parenthesis as tuple notation would clash with mathematical expressions. From now on, parenthesis are used for ordering statements and calling functions.

  2. Function Calls

    Validation

    Many languages have a notion of function. Cxy is no different. Unfortunately, many languages tend to make the syntax somewhat abhorring. In some cases it is necessary, but in Cxy it is not. The result of 1. creates a tiny artifact with the current function system.

    Cxy Implementation

    To resolve the artifact, a valid syntax is to have an ending comma behind even a single element to notify that this object is indeed part of a tuple. This is useful for functions because it allows for the function call to be expanded later.

    function(1,);

    This will allow the function to take args[0] as the single argument. When later expanding the function to take more arguments:

    function(1,2);

    nothing in the code will need to be changed. If we did not utilize the ending comma, then args would be equal to the object 1. This would not be an array. Normally, args is expected to be an array containing different elements. Of course, some functions that will absolutely not use more than one argument may as well accept an object directly. An example of such a function is cos(n).

  3. If Statements

    Validation

    Many languages contain the if-statement. It is a conditional branch statement, vital to all modern languages.

    Cxy Implementation

    There is currently a thought about the if-statement being a binary operator for 2 arguments. The first argument is an expression that evaluates to a boolean. The second argument is an inline function. If there is no function, the a single statement will be taken as the argument.

    if (condition) {++a;}

    This is pretty much exactly the same as C, C++, and Java. This will be parsed as

    condition if; {++a;};

    If the condition is true, the next statement is checked: “Is it a function? Execute it inline. Not a function? Execute the statement.”. The subsequent statement is skipped if the statement evaluates to false. This syntax is more liberal than Java or C++, as it allows for:

    public a = 3;
    public function = {++a;};
    if (a > 2) function;
    ...
    
  4. While Statements

    Validation

    The while statement is basically a conditional jump statement executed such that it jumps in front of its own jump. Regardless, I wish to enhance the while with some rules that not only make it simpler, but also stronger.

    Cxy Implementation

    It appears to be convenient to see if a while has executed at least once. This syntax is more liberal than Java or C++, as it allows for:

    public a = 30;
    while (a > 1)
    {
     --a;
     f();
    }
    else
    {
      a = 1;
    }
    

    This code will guarantee that a will eventually be equal to 1. The while will affect the if register of successful execution on the first run. If the while is run at least once, the else will not execute. This avoids the usage of unnecessary counters that obfuscate the code and hinder the programmer from implementing the logic.

  5. Do-While Statements

    Validation

    The do-while allows you to guarantee at least one execution of a while statement. The condition is hereby not checked. However, this is not a general case. What if we would like 2 iterations or more?

    Cxy Implementation

    do 1 while (condition) 
    {
      f();
    }
    

    Here, the 1 may be omitted. The do alone implies 1 unchecked iteration. If a number is given, then the piece of code is executed n times before the condition is checked. This also easily allows the off-switch on the do by giving 0 as argument.
    Why bother with C++ and C’s implementation where the while has to be appended to the end of the code to be executed? In addition, this requires an extra semicolon in C/C++/Java, thus making code more bug-prone. In addition, the following syntax may be used:

    do x
    {
      f();
    }
    

    as a very simple integer based loop.

  6. For Statements

    Validation

    Iterating an object that contains states or items is very useful. In many languages, it is possible to define variables for use within this loop. This makes the loop very clean. However, there appears to be a limited amount of freedom. One such example is the inability to declare different types in C++.

    Cxy Implementation

    for (local x = %list : condition_statement : ++x)
    {
      f();
    }
    

    This for is basically a layer over the for previously defined. It is easier to read and more comfortable to work with. A pre-parser should parse the contents of the parenthesis into 3 individual functions. For itself takes 4 functions as arguments: Declarations, Condition, Iteration, Executor.

  7. Inline Calls

    Validation

    A function is called inline when its body is executed as if it were member code of the current function context. It is necessary to use macros in C and C++ to make truly inline function calls. The inline function execution does not push the stack of reads, only the stack of new data declaration.

    Cxy Implementation

    Because if, while, and for call their functions inline. It ought to be mandatory for an inline keyword to exist. The syntax is as follows.

    inline function_name;
    

    There is no need for an argument list as the argument list from the lowest non-inline function is applied.

A lot more will come over time. For now, this will do. A few new modules will need to be added to the current toolchain for parsing this grammar.

The Infernal Loop Iterator and Cxy’s For

What is the infernal loop iterator?

The infernal loop iterator is a generic piece of code used in a program. It can iterate over all elements contained within a collection. The problem is that it is unsafeMusing Mortoray brought up a good point of a severe flaw in many languages. This flaw is brought to us by the for-each structure.
Said structure has the flaw brought forward by modifying said structure within a for loop.

for (auto &item : collection)
‍	operation(item);

There is no guarantee that operation does not edit the actual collection object. It may remove or add objects, causing undefined behaviour in C++ as the iterator is invalidated. Normal incrementation now comes with the occasional segfault. A Heisenbug has been born. Good luck debugging it.

Why not use raw elemental integer iteration?

As explained by Mortoray, such iteration is valid at all times. The exception being that it may skip or iterate over the same element multiple times (depending on the actions of operation). Could this possibly be solved?

Let’s see what happens if we make the integer a reference to an internally defined object inside collection. Now define functionality such that this integer changes automatically when an object is added or removed. Does this create a new problem?

Yes. Now the problem will manifest itself with non-local edits to collection: How are we going to let the iterator note that a new element was added at the beginning? In addition, the use of integers would not be applicable to (hash-)sets or (hash-)maps.

Desired Behaviour

I can currently think of 2 well-defined behavioural possibilities applicable:

    1. If removed before iteration, do not iterate. If added before iteration element reached, iterate.
    2. Any addition is not iterated. Removed items are iterated.

1. The desired behaviour is well-defined for all edits performed on a collection within a loop. This first idea is make the collection mutable. If an item is added, it needs to be iterated. If an item is removed before being iterated, it is not iterated. If an item is removed after being iterated, it will not be present post-for. This can be accomplished by having 2 collections. 1 empty, 1 original. As we iterate, an element is moved from original collection to the new collection. If an element is added, it is added to the original collection. If an element is removed, it is searched for in the original, and if not found, the new collection. It is subsequently removed from its respective set.

  • Pros:
    • In-Loop removal of uninteresting elements
    • In-Loop addition and automatic iteration
    • In-Loop removal of already iterated elements
    • Useful for (partially) recursive algorithms
  • Cons:
    • ‍‍No well defined order of iteration
    • Possible infinite loops

2. The second idea creates a shallow copy of the collection and performs all edits upon this copy, while its iterator iterates over elements defined inside the original collection object.

  • Pros: ‍‍
    • Always iterating all items that were in the collection just before the iteration
    • Lower chance of infinite loops due to conditional addition
    • Cleanly defined set perfect for non-recursive algorithms
  • Cons: ‍‍
    • No well defined order of iteration
    • ‍‍No in-loop changes
    • Algorithmically complex to use recursive functions
      ‍‍ ‍‍

How it may look like in Cxy

The syntax will be explained after this section.
For each element iteration in a table:

local col;

col.push(9);
col.push("A string");
col.add("key", "value");

for col item
{
    if (item[1] > 3)
    {
        col.push(2);
    }
}

If we pushed 4 or higher, then this iteration would never end. A new element is constantly introduced. This happens with option 1. The col size approaching ∞.
With option 2, the new element is never iterated. The end result is a col of size 4.

For integer-element iteration in a table:

local col;

col.push(3);
col.push(9);
col.push("A string");

for col item
{
    if (item[1] > 3)
    {
        col.push(-8);
    }
}

The strength with the for keyword is that it may define a specific order of iteration. An order where 0 is the first element and size – 1 the last,  incrementing the current element to approach size – 1.

This guarantee can only be granted if all keys are integers and aligned.

Powerfor

Powerfor is a for that grants us total control over the iterations. It is seen in many mainstream programming languages. The for has 4 parameters in the following order:

  1. Declaration of data
  2. Boolean function returning whether to continue running or not
  3. Code after each iteration
  4. Iteration code
for (int i = 0; i < 300; ++i)
{
    std::cout << i << '\n';
}
std::cout << std::flush;

I think it will be useful to have a similar structure in Cxy. For the ease of the parser and tokenizer I think 2 functions can be utilized instead:

for {++ *i; local !*i = 0; i < 300;}
{
    doSomething(i);
}

The RPN parser will evaluate this as:

__func0 __func1 for

Note that i is not assigned 0 every iteration. The first function works on the iterator and manages it. The second function acts using the iterator. for makes sure that the locally defined variables are also accessible inside the second function.

With powerfor in mind, we can re-write the previously defined for as a powerfor:

local col;

col.push(3);
col.push(9);
col.push("A string");

for col item
{
    if (item[1] > 3)
    {
        col.push(-8);
    }
}

Becomes

local col;

col.push(3);
col.push(9);
col.push("A string");

for {++ *i; local !*i = 0; i < col.length;}
{
    if (col[i] > 3)
    {
        col.push(-8);
    }
}

A little syntax:

++ *i;

The unary * is the “existence” qualifier. It continues execution of the current statement if actually exists. If not, it terminates the current statement immediately without error. The ++ increments the value of i by 1.

local !*i = 0;

The behaviour of !* is to abort the execution of the current operation if the item has been defined previously. If the item does exist, current execution is terminated without error.

for col it
{}

The first argument specifies the collection to iterate over, the second the iterator name used in the function coming afterwards. The function will receive the name in its namespace, as well as all elements visible right before for.

Concluding Ideal Behaviour

What is most practical of the 2 ideal behaviours for these types of loops?

When processing text in C++ for Cxy I end up with a function that adds elements to an array. This function needs to perform lexical analysis on each element of said array. It is rather tedious for me to use the C++ syntax to avoid segmentation faults and undefined/unspecified behaviour. Especially since I am using iterators. I suppose the first idea is superiour. Of course, it’s all open to debate, and it may be changed in the future.

Implementation wise, the following algorithm differs from the description first given, but I suppose it is both faster and easier to implement. I have in mind is as follows:

  1. Copy container A into B.
  2. Pop the top from A
  3. Perform operations using the this top as parameter
  4. if any removal/addition occurs, apply them to both A and B.
  5. if there are more elements in A, goto 2.

That seems to be about right. Considering that the Cxy uses references, a shallow copy is very cheap. It is in fact on-par with an std::memcpy call. Regardless, premature optimization equates stagnation, and is therefore premature slowdown.

Next up: a nested-tuple pre-parser for the RPN algorithm (SSYA).

Irreducible Operations and Primitive Types

What are irreducible operations?

Cxy is a high level language. This means that there will be operations defined in the language that can not be broken down any more inside Cxy itself. An example is integer addition, multiplication, etc. These are irreducible.

So which primitive types are there?

Many modern high-level languages define a set of primitive types. We find in the Zend engine – for PHP – that types are defined as such:

typedef union _zvalue_value 
{
    long lval; /* long value */
    double dval; /* double value */
    struct 
    {    
        char *val;
        int len; /* this will always be set for strings */
    } str; /* string (always has length) */
    HashTable *ht; /* an array */
    zend_object_value obj; /* stores an object store handle, and handlers */
} zvalue_value;

So to summarize, Zend has the following primitive types:

  • Long Signed Integer
  • Double Precision Floating Point
  • String (Limited by C’s int)
  • HashTable

Desirable Traits

These types will often give high-performance and time-critical results as they are defined by the C standard and are native to the CPU and/or FPU.
However, Cxy has no time for premature optimization of any kind due to high latency tolerance in its applications. The goal is to here is to define types that are:

  • Simple
  • Generic
  • Powerful

PHP’s types, but different…

To envision this, I have sought to provide the same primitive types as PHP grants. The reason being that anything can be conveyed through a string alone. But first there must be a few clarifications and declarations of necessities:

  1. Booleans will be stored as an integer of 0 or not 0, where 0 is false and not 0 is true
  2. Characters do not exist. Instead, ‘glyphs’ do
  3. Strings do not exist. Instead, ‘text’ does
  4. Glyphs will be stored as single-element text. (unlike char in C/C++/Java)
  5. Integers are text
  6. Floats are text

Cxy: Everything is Text!

Well, yes! This appears perfectly sane, with the exception that operators must be unambiguously defined on the data types.

The operators are defined in the following list:

  1. +
  2. *
  3. /
  4. |
  5. \
  6. %
  7. []
  8. =
  9. :=
  10. ++
  11. ^
  12. <<
  13. >>
  14. <=
  15. ==
  16. >=
  17. !=
  18. &
  19. &&
  20. ||
  21. +=
  22. -=
  23. *=
  24. /=
  25. \=
  26. %=
  27. <<=
  28. >>=
  29. &=
  30. |=
  31. ()

Operator Ambiguity and Tables

Now we can ask ourselves which operations are ambiguous. + on actual text and integers can be ambiguous, however we have the concatenation operator &, so + requires the interpretation of both operands into numbers. It appears no ambiguous operators are present.
This however gives us the slightest bit of trouble within the engine. How can an object contain data?
– This is where the HashTable comes in, or as I have taken to call it: “Table”.
Here is the code in the current project’s Table (C++).

    /**
        \brief The Table Container

        This container functions exactly like a map, with the exception being
        that it can be pushed and popped like a normal vector.

        The compatible key type is one that has the following features:
        1. Default construction is the zero-point (null); which is well-defined.
        2. Supports postfix ++.
        3. Supports prefix --.
        3. Supports operator<
    */
    template <typename Key, typename Value>
    class Table : public std::map<Key, Value>
    {
    public:

        using std::map<Key, Value>::map;


        auto length ( ) const -> const Key &
        {
            return m_current_index;
        }


        auto front ( ) const -> const Value &
        {
            Key null;
            return (*this)[null];
        }


        auto front ( ) -> Value &
        {
            Key null;
            return (*this)[null];
        }


        auto back ( ) const -> const Value &
        {
            Key copy ( m_current_index );
            --copy;
            return (*this)[copy];
        }


        auto back ( ) -> Value &
        {
            Key copy ( m_current_index );
            --copy;
            return (*this)[copy];
        }


        void push_back ( )
        {
            this->insert(std::make_pair(m_current_index++, Value()));
        }


        void push_back ( const Value &register_arg )
        {
            this->insert(std::make_pair(m_current_index++, register_arg));
        }


        void push_back ( Value &&register_arg )
        {
            this->insert(std::make_pair(m_current_index++, std::move(register_arg)));
        }


        void pop_back ( )
        {
            --m_current_index;
            this->erase(m_current_index);
        }


    private:

        Key m_current_index;

    };

With Tables and Text, can we perform all desired operations?

Yes. However, the current interpreter needs to use a base type which describes a variable. This base type (base as in inheritance) will be used to keep track of the variable being a table or text.
We can easily idealize a table as a piece of text. However, this idealization will probably not hold true within the interpreter. The engine’s internals are regarded as unimportant so long the desired results are achieved anyway.

A complete overview over the data types

  • Text
    • Integer
    • Float
    • Glyph Sequence
  • Table

The nested items under Text are the items that “text” can be simultaneously interpreted as. Idealized as an actual piece of text, implemented most conveniently.

Any chance to see Operator Overloading?

Definitely! Any Table can define its own operators. In fact, it must. There are no standard operators defined for Tables!
Interestingly, the current tokenizer would one to allow the definition of completely arbitrary operators using a specific text (e.g. /_\ or |&|) such as

 a |&| b; 

I suppose that feature is not nearly as important as anything right now so it’ll be postponed.

Therewith, Hath Some Code!

String and integer cooperation:

public a = 3; // Integer
public b = "4"; // Text, single glyph

public c = a + b; // c becomes "7".
c &= a; // c becomes "73"

Table usage:

public a;
public a.b = 9;
public a.c = 
{
	if (b < 100)
		++b;
	if (d exists)
		d &= T"Added one\n";
	else
		private d;
};

a.c(); // Call the method

–a + ++a, Undefined Order?

In C and C++, the order of expression evaluation in which there exist manipulations of the actual values is undefined.

I wish to remedy some of this UD in order to make Cxy safer and more strict.

Currently, --a + ++a will evaluate from left to right. The SSHA (custom Shunting-Yard Algorithm) will evaluate this as:

a __p-- a __p++ +

When the stack-machine is operating, the stack will contain 3 elements after the — and ++ expressions have been evaluated.

refa refa +

Where refa is a reference to a after the operation. Because of the prefixes, no new element is returned.

What just happened?

Firstly, a was decremented, then incremented. After that, we add a and a, which would be the same as without doing anything to a at all.

Is this desired behaviour?

Imagine --a * b + ++a;
This is translated to:
a __p-- b * a __p++ +

Here, the decremented a does indeed have an effect, as the expression where b is multiplied with a is evaluated before the increment operator.

Currently I’m undecided as to this being defined, undefined, or unspecified.

Cxy Functions

I’ve been thinking about functions in Cxy. I have a convenient syntax for creating n-tuples using the (n1,n2,n3,…,nm) notation. An example is a 3-tuple of numbers: (1, 9, 4). Every function gets an argument (a tuple), and returns a tuple.

The comma-operator in this regard appents an item to a tuple. (9, 4, 2) will create the tuple (9, 4), and will append 2, such that we get (9, 4, 2). When we write ((9, 4), 2), then a tuple of 9 and 4 is made and we get: (tuple_1, 4). We can only access the 9 or 4 via this tuple_1 (it won’t be named tuple_n, just an example). To do this, we write:

public name = ((9, 4), 2);
name[0][1]; // Access to 4.

Such than one can easily create nested tuples of elements (more like arrays).

Anyhow, this is useful for functions in Cxy, which I currently think will be declared by the simple syntax:

{};

There is no need for an argument list, that complicates the grammar too much. Instead, we have a standard tuple called args. This tuple can be accessed by simple index operations “[n]” in order to get the arguments. In addition, the tuple can contain the total number of elements (using a size member). This allows one to easily create variadic-argument functions (such as finding the maximum of a list of elements).

Now, functions themselves are pretty small and simple as well:

f =
 {
    if (args.elements >= 1)
    {
         args[0]++;
         return args[0] ^ 2;
    }
    else
        throw invalid_argument;
 };

print(f(3));

The language is changing quite a lot, so it’s still very plastic. I may change a lot of this in the future.

Shunt-Yarding a.b(c.d());

I’ve recently stumbled across a little artifact in my shunting-yard algorithm. This artifact can be observed by attempting to parse
a.b(c.d());

Let’s see what can be done to parse this correctly.
Firstly, the case where we use standard parsing:

a b . _op c d . _op
Execution -> e f

Where the dot operator accesses a member from the previous case.
This is wrong, functions are evaluated left-to-right in Reverse Polish Notation. This means that in our case, we evaluate a.b() first… Could we perhaps allow the reading of right-sided function objects? Would this violate RPN?

Well yes, this does violate RPN and gives us a lot of trouble. c d . is a function that has to be evaluated on its own merit. If we create a stack such that we have to check left and right for function parameters, then we’ll get the problem of order of execution. If we evaluate solely right-to-left, then we end up with the same problem as left-to-right!

The main problem is that variable names are unknown to be functions (functors) or just references to variables. The state of a variable is revealed once a ‘(‘ character is directly superseding the name. This gave me an idea…

I’ve come up with a simple solution to this problem:
create a buffer area between the input and output which can be pushed onto the operator stack if a ‘(‘ is encountered. Here’s an example of how it works:

name goes into queue.
( encountered, name is put onto stack.
) encountered, everything until ( popped to output.

f().b
is parsed into:
f _op b .

So to summarize, here are the 2 core changes to the shunting-yard algorithm:
1. Create a dot-builder (a queue between input and output).
2. Keep track of the last token’s type, was it a name or a functional token? If it was a name token and we encounter ‘(‘, then we have a function call, so we insert _op (which is a function token).

The only problem left is the case where we have

f()(g());

That’ll be solved another time.

Dxy and its Rules

Dxy is the tokenizer in Cxy. It can be considered “Data-XY”. This tokenizer divides a text into a sequence of tokens. The tokens can then be iterated over by the rest of the program. This is practical and easy, but there are some design choices to be made.

The Dxy Tokenization Program currently has the following rules to abide by:

1. Quote-enclosed text
2. A single semicolon, dot, comma, colon, parenthesis, brackets, square brackets
3. Non-whitespace characters in a sequence

It is observed that certain special characters are tokens on their own. The punctuators and groupers are chosen to represent special tokens due to their perceived importance in Cxy.

I think that the differences in what is useful to Cxy and general data storage are different, and therefore I allow the optional argument of a regular expression to the tokenize function. Cxy will internally use a regular expression that is useful to itself, whereas I think the standard Dxy tokenization programme should be as follows:

1. Quote-enclosed text.
2. Non-whitespace sequences.

Which is very simple. Quote-enclosed text can also contain quote by writing “”, which will be evaluated as “. This evaluation happens after tokenization when scanning through each token tho. I think this is the way to go.