Cxy Syntax Updates

Having a complex syntax means having a complex parser. A complex parser implies more time spent creating and maintaining. More maintenance may leave bugs alive for a longer period. More bugs means more instability. More instability means less predictability and trustworthiness. This results in a nuisance to the programmer. An undesirable trait.

 

I have chosen to adhere to some standards that I think make the language both easier to read and write. Not to mention that the syntax is very simple to parse. The Cxy standard has some familiar concepts seen in other programming languages as well.

  1. Array/Tuple Notation

    Validation

    It has proven to be quite useful to create arrays or tuples on the fly. This can be observed in some object-oriented languages. An example is C++, where tuple unpacking can swap multiple elements. We can also create an “initializer list” which follows a {} convention. The same can be done in Java. This allows one to painlessly create single or multidimensional arrays.

    Cxy Implementation

    The syntax for arrays in Cxy is not so different. The [] tokens may be used for this purpose. However, Cxy has one remark to make. If elements are consecutively listed and separated using a comma, then this will evaluate to an array without the brackets. The brackets are only necessary when designating a nested array. As an example:

    public x = 1, 3, 4, [9, 0, 4], 3, [0], 0;

    x itself contains only 7 distinct objects. 2 of these objects are arrays. The rest of these objects are numbers. The first subarray at position 3 (remember, indexing starts at 0), contains 3 elements. The array at position 5 contains 1 element. A bracket around this entire assembly would make x an array of size 1, containing another array. This solves the pre-tuple pre-parser which would be necessary if we were to use parenthesis as tuple/array notation. In addition, parenthesis as tuple notation would clash with mathematical expressions. From now on, parenthesis are used for ordering statements and calling functions.

  2. Function Calls

    Validation

    Many languages have a notion of function. Cxy is no different. Unfortunately, many languages tend to make the syntax somewhat abhorring. In some cases it is necessary, but in Cxy it is not. The result of 1. creates a tiny artifact with the current function system.

    Cxy Implementation

    To resolve the artifact, a valid syntax is to have an ending comma behind even a single element to notify that this object is indeed part of a tuple. This is useful for functions because it allows for the function call to be expanded later.

    function(1,);

    This will allow the function to take args[0] as the single argument. When later expanding the function to take more arguments:

    function(1,2);

    nothing in the code will need to be changed. If we did not utilize the ending comma, then args would be equal to the object 1. This would not be an array. Normally, args is expected to be an array containing different elements. Of course, some functions that will absolutely not use more than one argument may as well accept an object directly. An example of such a function is cos(n).

  3. If Statements

    Validation

    Many languages contain the if-statement. It is a conditional branch statement, vital to all modern languages.

    Cxy Implementation

    There is currently a thought about the if-statement being a binary operator for 2 arguments. The first argument is an expression that evaluates to a boolean. The second argument is an inline function. If there is no function, the a single statement will be taken as the argument.

    if (condition) {++a;}

    This is pretty much exactly the same as C, C++, and Java. This will be parsed as

    condition if; {++a;};

    If the condition is true, the next statement is checked: “Is it a function? Execute it inline. Not a function? Execute the statement.”. The subsequent statement is skipped if the statement evaluates to false. This syntax is more liberal than Java or C++, as it allows for:

    public a = 3;
    public function = {++a;};
    if (a > 2) function;
    ...
    
  4. While Statements

    Validation

    The while statement is basically a conditional jump statement executed such that it jumps in front of its own jump. Regardless, I wish to enhance the while with some rules that not only make it simpler, but also stronger.

    Cxy Implementation

    It appears to be convenient to see if a while has executed at least once. This syntax is more liberal than Java or C++, as it allows for:

    public a = 30;
    while (a > 1)
    {
     --a;
     f();
    }
    else
    {
      a = 1;
    }
    

    This code will guarantee that a will eventually be equal to 1. The while will affect the if register of successful execution on the first run. If the while is run at least once, the else will not execute. This avoids the usage of unnecessary counters that obfuscate the code and hinder the programmer from implementing the logic.

  5. Do-While Statements

    Validation

    The do-while allows you to guarantee at least one execution of a while statement. The condition is hereby not checked. However, this is not a general case. What if we would like 2 iterations or more?

    Cxy Implementation

    do 1 while (condition) 
    {
      f();
    }
    

    Here, the 1 may be omitted. The do alone implies 1 unchecked iteration. If a number is given, then the piece of code is executed n times before the condition is checked. This also easily allows the off-switch on the do by giving 0 as argument.
    Why bother with C++ and C’s implementation where the while has to be appended to the end of the code to be executed? In addition, this requires an extra semicolon in C/C++/Java, thus making code more bug-prone. In addition, the following syntax may be used:

    do x
    {
      f();
    }
    

    as a very simple integer based loop.

  6. For Statements

    Validation

    Iterating an object that contains states or items is very useful. In many languages, it is possible to define variables for use within this loop. This makes the loop very clean. However, there appears to be a limited amount of freedom. One such example is the inability to declare different types in C++.

    Cxy Implementation

    for (local x = %list : condition_statement : ++x)
    {
      f();
    }
    

    This for is basically a layer over the for previously defined. It is easier to read and more comfortable to work with. A pre-parser should parse the contents of the parenthesis into 3 individual functions. For itself takes 4 functions as arguments: Declarations, Condition, Iteration, Executor.

  7. Inline Calls

    Validation

    A function is called inline when its body is executed as if it were member code of the current function context. It is necessary to use macros in C and C++ to make truly inline function calls. The inline function execution does not push the stack of reads, only the stack of new data declaration.

    Cxy Implementation

    Because if, while, and for call their functions inline. It ought to be mandatory for an inline keyword to exist. The syntax is as follows.

    inline function_name;
    

    There is no need for an argument list as the argument list from the lowest non-inline function is applied.

A lot more will come over time. For now, this will do. A few new modules will need to be added to the current toolchain for parsing this grammar.

The Infernal Loop Iterator and Cxy’s For

What is the infernal loop iterator?

The infernal loop iterator is a generic piece of code used in a program. It can iterate over all elements contained within a collection. The problem is that it is unsafeMusing Mortoray brought up a good point of a severe flaw in many languages. This flaw is brought to us by the for-each structure.
Said structure has the flaw brought forward by modifying said structure within a for loop.

for (auto &item : collection)
‍	operation(item);

There is no guarantee that operation does not edit the actual collection object. It may remove or add objects, causing undefined behaviour in C++ as the iterator is invalidated. Normal incrementation now comes with the occasional segfault. A Heisenbug has been born. Good luck debugging it.

Why not use raw elemental integer iteration?

As explained by Mortoray, such iteration is valid at all times. The exception being that it may skip or iterate over the same element multiple times (depending on the actions of operation). Could this possibly be solved?

Let’s see what happens if we make the integer a reference to an internally defined object inside collection. Now define functionality such that this integer changes automatically when an object is added or removed. Does this create a new problem?

Yes. Now the problem will manifest itself with non-local edits to collection: How are we going to let the iterator note that a new element was added at the beginning? In addition, the use of integers would not be applicable to (hash-)sets or (hash-)maps.

Desired Behaviour

I can currently think of 2 well-defined behavioural possibilities applicable:

    1. If removed before iteration, do not iterate. If added before iteration element reached, iterate.
    2. Any addition is not iterated. Removed items are iterated.

1. The desired behaviour is well-defined for all edits performed on a collection within a loop. This first idea is make the collection mutable. If an item is added, it needs to be iterated. If an item is removed before being iterated, it is not iterated. If an item is removed after being iterated, it will not be present post-for. This can be accomplished by having 2 collections. 1 empty, 1 original. As we iterate, an element is moved from original collection to the new collection. If an element is added, it is added to the original collection. If an element is removed, it is searched for in the original, and if not found, the new collection. It is subsequently removed from its respective set.

  • Pros:
    • In-Loop removal of uninteresting elements
    • In-Loop addition and automatic iteration
    • In-Loop removal of already iterated elements
    • Useful for (partially) recursive algorithms
  • Cons:
    • ‍‍No well defined order of iteration
    • Possible infinite loops

2. The second idea creates a shallow copy of the collection and performs all edits upon this copy, while its iterator iterates over elements defined inside the original collection object.

  • Pros: ‍‍
    • Always iterating all items that were in the collection just before the iteration
    • Lower chance of infinite loops due to conditional addition
    • Cleanly defined set perfect for non-recursive algorithms
  • Cons: ‍‍
    • No well defined order of iteration
    • ‍‍No in-loop changes
    • Algorithmically complex to use recursive functions
      ‍‍ ‍‍

How it may look like in Cxy

The syntax will be explained after this section.
For each element iteration in a table:

local col;

col.push(9);
col.push("A string");
col.add("key", "value");

for col item
{
    if (item[1] > 3)
    {
        col.push(2);
    }
}

If we pushed 4 or higher, then this iteration would never end. A new element is constantly introduced. This happens with option 1. The col size approaching ∞.
With option 2, the new element is never iterated. The end result is a col of size 4.

For integer-element iteration in a table:

local col;

col.push(3);
col.push(9);
col.push("A string");

for col item
{
    if (item[1] > 3)
    {
        col.push(-8);
    }
}

The strength with the for keyword is that it may define a specific order of iteration. An order where 0 is the first element and size – 1 the last,  incrementing the current element to approach size – 1.

This guarantee can only be granted if all keys are integers and aligned.

Powerfor

Powerfor is a for that grants us total control over the iterations. It is seen in many mainstream programming languages. The for has 4 parameters in the following order:

  1. Declaration of data
  2. Boolean function returning whether to continue running or not
  3. Code after each iteration
  4. Iteration code
for (int i = 0; i < 300; ++i)
{
    std::cout << i << '\n';
}
std::cout << std::flush;

I think it will be useful to have a similar structure in Cxy. For the ease of the parser and tokenizer I think 2 functions can be utilized instead:

for {++ *i; local !*i = 0; i < 300;}
{
    doSomething(i);
}

The RPN parser will evaluate this as:

__func0 __func1 for

Note that i is not assigned 0 every iteration. The first function works on the iterator and manages it. The second function acts using the iterator. for makes sure that the locally defined variables are also accessible inside the second function.

With powerfor in mind, we can re-write the previously defined for as a powerfor:

local col;

col.push(3);
col.push(9);
col.push("A string");

for col item
{
    if (item[1] > 3)
    {
        col.push(-8);
    }
}

Becomes

local col;

col.push(3);
col.push(9);
col.push("A string");

for {++ *i; local !*i = 0; i < col.length;}
{
    if (col[i] > 3)
    {
        col.push(-8);
    }
}

A little syntax:

++ *i;

The unary * is the “existence” qualifier. It continues execution of the current statement if actually exists. If not, it terminates the current statement immediately without error. The ++ increments the value of i by 1.

local !*i = 0;

The behaviour of !* is to abort the execution of the current operation if the item has been defined previously. If the item does exist, current execution is terminated without error.

for col it
{}

The first argument specifies the collection to iterate over, the second the iterator name used in the function coming afterwards. The function will receive the name in its namespace, as well as all elements visible right before for.

Concluding Ideal Behaviour

What is most practical of the 2 ideal behaviours for these types of loops?

When processing text in C++ for Cxy I end up with a function that adds elements to an array. This function needs to perform lexical analysis on each element of said array. It is rather tedious for me to use the C++ syntax to avoid segmentation faults and undefined/unspecified behaviour. Especially since I am using iterators. I suppose the first idea is superiour. Of course, it’s all open to debate, and it may be changed in the future.

Implementation wise, the following algorithm differs from the description first given, but I suppose it is both faster and easier to implement. I have in mind is as follows:

  1. Copy container A into B.
  2. Pop the top from A
  3. Perform operations using the this top as parameter
  4. if any removal/addition occurs, apply them to both A and B.
  5. if there are more elements in A, goto 2.

That seems to be about right. Considering that the Cxy uses references, a shallow copy is very cheap. It is in fact on-par with an std::memcpy call. Regardless, premature optimization equates stagnation, and is therefore premature slowdown.

Next up: a nested-tuple pre-parser for the RPN algorithm (SSYA).

Irreducible Operations and Primitive Types

What are irreducible operations?

Cxy is a high level language. This means that there will be operations defined in the language that can not be broken down any more inside Cxy itself. An example is integer addition, multiplication, etc. These are irreducible.

So which primitive types are there?

Many modern high-level languages define a set of primitive types. We find in the Zend engine – for PHP – that types are defined as such:

typedef union _zvalue_value 
{
    long lval; /* long value */
    double dval; /* double value */
    struct 
    {    
        char *val;
        int len; /* this will always be set for strings */
    } str; /* string (always has length) */
    HashTable *ht; /* an array */
    zend_object_value obj; /* stores an object store handle, and handlers */
} zvalue_value;

So to summarize, Zend has the following primitive types:

  • Long Signed Integer
  • Double Precision Floating Point
  • String (Limited by C’s int)
  • HashTable

Desirable Traits

These types will often give high-performance and time-critical results as they are defined by the C standard and are native to the CPU and/or FPU.
However, Cxy has no time for premature optimization of any kind due to high latency tolerance in its applications. The goal is to here is to define types that are:

  • Simple
  • Generic
  • Powerful

PHP’s types, but different…

To envision this, I have sought to provide the same primitive types as PHP grants. The reason being that anything can be conveyed through a string alone. But first there must be a few clarifications and declarations of necessities:

  1. Booleans will be stored as an integer of 0 or not 0, where 0 is false and not 0 is true
  2. Characters do not exist. Instead, ‘glyphs’ do
  3. Strings do not exist. Instead, ‘text’ does
  4. Glyphs will be stored as single-element text. (unlike char in C/C++/Java)
  5. Integers are text
  6. Floats are text

Cxy: Everything is Text!

Well, yes! This appears perfectly sane, with the exception that operators must be unambiguously defined on the data types.

The operators are defined in the following list:

  1. +
  2. *
  3. /
  4. |
  5. \
  6. %
  7. []
  8. =
  9. :=
  10. ++
  11. ^
  12. <<
  13. >>
  14. <=
  15. ==
  16. >=
  17. !=
  18. &
  19. &&
  20. ||
  21. +=
  22. -=
  23. *=
  24. /=
  25. \=
  26. %=
  27. <<=
  28. >>=
  29. &=
  30. |=
  31. ()

Operator Ambiguity and Tables

Now we can ask ourselves which operations are ambiguous. + on actual text and integers can be ambiguous, however we have the concatenation operator &, so + requires the interpretation of both operands into numbers. It appears no ambiguous operators are present.
This however gives us the slightest bit of trouble within the engine. How can an object contain data?
– This is where the HashTable comes in, or as I have taken to call it: “Table”.
Here is the code in the current project’s Table (C++).

    /**
        \brief The Table Container

        This container functions exactly like a map, with the exception being
        that it can be pushed and popped like a normal vector.

        The compatible key type is one that has the following features:
        1. Default construction is the zero-point (null); which is well-defined.
        2. Supports postfix ++.
        3. Supports prefix --.
        3. Supports operator<
    */
    template <typename Key, typename Value>
    class Table : public std::map<Key, Value>
    {
    public:

        using std::map<Key, Value>::map;


        auto length ( ) const -> const Key &
        {
            return m_current_index;
        }


        auto front ( ) const -> const Value &
        {
            Key null;
            return (*this)[null];
        }


        auto front ( ) -> Value &
        {
            Key null;
            return (*this)[null];
        }


        auto back ( ) const -> const Value &
        {
            Key copy ( m_current_index );
            --copy;
            return (*this)[copy];
        }


        auto back ( ) -> Value &
        {
            Key copy ( m_current_index );
            --copy;
            return (*this)[copy];
        }


        void push_back ( )
        {
            this->insert(std::make_pair(m_current_index++, Value()));
        }


        void push_back ( const Value &register_arg )
        {
            this->insert(std::make_pair(m_current_index++, register_arg));
        }


        void push_back ( Value &&register_arg )
        {
            this->insert(std::make_pair(m_current_index++, std::move(register_arg)));
        }


        void pop_back ( )
        {
            --m_current_index;
            this->erase(m_current_index);
        }


    private:

        Key m_current_index;

    };

With Tables and Text, can we perform all desired operations?

Yes. However, the current interpreter needs to use a base type which describes a variable. This base type (base as in inheritance) will be used to keep track of the variable being a table or text.
We can easily idealize a table as a piece of text. However, this idealization will probably not hold true within the interpreter. The engine’s internals are regarded as unimportant so long the desired results are achieved anyway.

A complete overview over the data types

  • Text
    • Integer
    • Float
    • Glyph Sequence
  • Table

The nested items under Text are the items that “text” can be simultaneously interpreted as. Idealized as an actual piece of text, implemented most conveniently.

Any chance to see Operator Overloading?

Definitely! Any Table can define its own operators. In fact, it must. There are no standard operators defined for Tables!
Interestingly, the current tokenizer would one to allow the definition of completely arbitrary operators using a specific text (e.g. /_\ or |&|) such as

 a |&| b; 

I suppose that feature is not nearly as important as anything right now so it’ll be postponed.

Therewith, Hath Some Code!

String and integer cooperation:

public a = 3; // Integer
public b = "4"; // Text, single glyph

public c = a + b; // c becomes "7".
c &= a; // c becomes "73"

Table usage:

public a;
public a.b = 9;
public a.c = 
{
	if (b < 100)
		++b;
	if (d exists)
		d &= T"Added one\n";
	else
		private d;
};

a.c(); // Call the method

–a + ++a, Undefined Order?

In C and C++, the order of expression evaluation in which there exist manipulations of the actual values is undefined.

I wish to remedy some of this UD in order to make Cxy safer and more strict.

Currently, --a + ++a will evaluate from left to right. The SSHA (custom Shunting-Yard Algorithm) will evaluate this as:

a __p-- a __p++ +

When the stack-machine is operating, the stack will contain 3 elements after the — and ++ expressions have been evaluated.

refa refa +

Where refa is a reference to a after the operation. Because of the prefixes, no new element is returned.

What just happened?

Firstly, a was decremented, then incremented. After that, we add a and a, which would be the same as without doing anything to a at all.

Is this desired behaviour?

Imagine --a * b + ++a;
This is translated to:
a __p-- b * a __p++ +

Here, the decremented a does indeed have an effect, as the expression where b is multiplied with a is evaluated before the increment operator.

Currently I’m undecided as to this being defined, undefined, or unspecified.

Cxy Functions

I’ve been thinking about functions in Cxy. I have a convenient syntax for creating n-tuples using the (n1,n2,n3,…,nm) notation. An example is a 3-tuple of numbers: (1, 9, 4). Every function gets an argument (a tuple), and returns a tuple.

The comma-operator in this regard appents an item to a tuple. (9, 4, 2) will create the tuple (9, 4), and will append 2, such that we get (9, 4, 2). When we write ((9, 4), 2), then a tuple of 9 and 4 is made and we get: (tuple_1, 4). We can only access the 9 or 4 via this tuple_1 (it won’t be named tuple_n, just an example). To do this, we write:

public name = ((9, 4), 2);
name[0][1]; // Access to 4.

Such than one can easily create nested tuples of elements (more like arrays).

Anyhow, this is useful for functions in Cxy, which I currently think will be declared by the simple syntax:

{};

There is no need for an argument list, that complicates the grammar too much. Instead, we have a standard tuple called args. This tuple can be accessed by simple index operations “[n]” in order to get the arguments. In addition, the tuple can contain the total number of elements (using a size member). This allows one to easily create variadic-argument functions (such as finding the maximum of a list of elements).

Now, functions themselves are pretty small and simple as well:

f =
 {
    if (args.elements >= 1)
    {
         args[0]++;
         return args[0] ^ 2;
    }
    else
        throw invalid_argument;
 };

print(f(3));

The language is changing quite a lot, so it’s still very plastic. I may change a lot of this in the future.

Shunt-Yarding a.b(c.d());

I’ve recently stumbled across a little artifact in my shunting-yard algorithm. This artifact can be observed by attempting to parse
a.b(c.d());

Let’s see what can be done to parse this correctly.
Firstly, the case where we use standard parsing:

a b . _op c d . _op
Execution -> e f

Where the dot operator accesses a member from the previous case.
This is wrong, functions are evaluated left-to-right in Reverse Polish Notation. This means that in our case, we evaluate a.b() first… Could we perhaps allow the reading of right-sided function objects? Would this violate RPN?

Well yes, this does violate RPN and gives us a lot of trouble. c d . is a function that has to be evaluated on its own merit. If we create a stack such that we have to check left and right for function parameters, then we’ll get the problem of order of execution. If we evaluate solely right-to-left, then we end up with the same problem as left-to-right!

The main problem is that variable names are unknown to be functions (functors) or just references to variables. The state of a variable is revealed once a ‘(‘ character is directly superseding the name. This gave me an idea…

I’ve come up with a simple solution to this problem:
create a buffer area between the input and output which can be pushed onto the operator stack if a ‘(‘ is encountered. Here’s an example of how it works:

name goes into queue.
( encountered, name is put onto stack.
) encountered, everything until ( popped to output.

f().b
is parsed into:
f _op b .

So to summarize, here are the 2 core changes to the shunting-yard algorithm:
1. Create a dot-builder (a queue between input and output).
2. Keep track of the last token’s type, was it a name or a functional token? If it was a name token and we encounter ‘(‘, then we have a function call, so we insert _op (which is a function token).

The only problem left is the case where we have

f()(g());

That’ll be solved another time.

Dxy and its Rules

Dxy is the tokenizer in Cxy. It can be considered “Data-XY”. This tokenizer divides a text into a sequence of tokens. The tokens can then be iterated over by the rest of the program. This is practical and easy, but there are some design choices to be made.

The Dxy Tokenization Program currently has the following rules to abide by:

1. Quote-enclosed text
2. A single semicolon, dot, comma, colon, parenthesis, brackets, square brackets
3. Non-whitespace characters in a sequence

It is observed that certain special characters are tokens on their own. The punctuators and groupers are chosen to represent special tokens due to their perceived importance in Cxy.

I think that the differences in what is useful to Cxy and general data storage are different, and therefore I allow the optional argument of a regular expression to the tokenize function. Cxy will internally use a regular expression that is useful to itself, whereas I think the standard Dxy tokenization programme should be as follows:

1. Quote-enclosed text.
2. Non-whitespace sequences.

Which is very simple. Quote-enclosed text can also contain quote by writing “”, which will be evaluated as “. This evaluation happens after tokenization when scanning through each token tho. I think this is the way to go.

Cxy’s Status

I’ve been so busy with university work lately. I have currently finished 2 exams, and 2 more to go within a week.

Cxy has been under some heavy development in the meantime. I have extended the entire concept to the so-called Exy: Encapsulated XY.
Exy is a library encapsulating the entirety of Dxy, Cxy, Bxy, and Axy. I have written it with an extremely simple and straightforward interface in mind, partly inspired by the very clean C++ library known as SFML.

The language itself is also changed. From a low-level styled language to a high-level one. Mainly due to inspiration from JavaScript, Scala, C++, and LISP. These are languages I have been experimenting with, and I think some concepts can be used in Cxy as well.

An example of such a concept is the idea of prototype-based inheritance found in JavaScript. Another is the cleanliness of some LISP iterative structures. Of course, Scala reminds me very much of Cxy in the way that it allows the programmer to construct different constructs to program in. Excuse my ignorance on Scala, this was the explanation given to me by a friend of mine who has experimented with the language, so I may very well be wrong here due to my lack of understanding.

Anyhow, I have simply come to love the design of the JavaScript language. There are a few things I dislike with JavaScript as well, but I will not carry those elements into Cxy. The specific thing I abhor is the idea of immutable strings. I think this is a terrible design choice for a hypertext preprocessor like Cxy.

That’s it for now, I’ll probably post a little more often, in shorter posts from now on.

Regular Expressions and Cxy – A Big Overhaul

Regular expressions allow us to define amazingly complex instructions within a simple, human-readable line. Because of the incredibly useful nature of regular expressions; Cxy also needs them.

I have recently added 2 regular expressions:
– repl
– match

Both are taken from boost::regex and currently use Perl regex syntax. repl replaces all matches in the “cntnt” register with a second register. Here’s an example:

push x
cpy x "\\{"
cpy y "{ try {"
repl x y

The \\ is needed because the compiler will change it to a single \, because the compiler has an implementation of escape characters. The regex engine is thus fed \{. { is normally a special character used for numbered matches.
One can see from this code that it is exceptionally easy to replace as much as possible. I will however limit the replacement to the content between ptr and mrk. repl will not change ptr and mrk even tho it may change the size of the contained data.

match in itself is also a useful:

push x
cpy x ".*?#include .*"
match x
if match end
cpy x "#include "
ins x
:end

This piece of code will paste x if it does not exist in the string.

In the case of the Cxy engine, regular replacement regexes are incredibly useful. It avoids the re-definition of regular expressions in tier 1, which also means that regular expressions are very fast. The reasoning behind this is that the regex engine is implemented in tier 0, which means that it is compiled into native code by the C++ compiler. Altho the regex engine has to interpret or compile the regular expression fed into it; it will be much faster than executing a bunch of Cxy bytecode instructions.

Regarding Bytecode

I see now that it may be more beneficial to change from storing strings to storing integers instead. The compiled will become slightly more complex, but Cxy will improve its efficiency.
Currently, data and instructions are stored like this:

typedef std::map<String_t, std::vector> Data_t; // Requirements: A switch from one String_t to a stack of Register_t.
typedef std::vector Instructions_t; // Requirements: A random access iteratable collection of String_t.

However, I wish to implement Data_t as std::map<Sti_t, std::vector> and Instructions_t as std::vector or std::vector.
The reasoning behind this is that normal strings already contain a number that tells us the string size. Therefore, C++ strings do not have a character at their ends.

How will the execution stage change when this is implemented?
Let’s think about this. All instructions could be stored as single characters. Instead of putting raw data inside the instruction list, I could create a new vector containing “Operating Data” data. For example, Operating Data could contain 32 and 54, and Instructions_t could contain 33. 33 is the “if” instruction. It reads from 32 in Data_t and evaluates the boolean expression. The second Operating Data is the jump point. If 32 evaluates to true, the instruction pointer is changed to 54.

The reason behind all this change is that Cxy is currently not interpreted and there is no need to. There will be some problems with these changes, first of all; there will be quite some bugs. In addition, the compiler will need to change where it puts operating data into. I could also say that there can be no more than 256 symbols per Cxy script. This will be quite limiting, but it will combat the use of Sti_t (which is 8 bytes long).

256 Symbols, minus the builtin registers (22), makes for 234 user defined registers. Let’s not forget that a single register can contain an entire array of data; thus the data usage can be pretty much unlimited. I think I will limit it to the 256 unique symbols for looking up arrays. It seems correct and practical. Most algorithms will not even use more than 10 unique symbols. Even if a single algorithm uses up to 234 symbols; then it could simply push each symbol and re-use it. There is thus no need to more than 234 symbols.
I am however afraid that this argument can fall down a slippery slope. Are symbols at all needed? Why not store everything in a single vector?

First of all; such an implementation will be quite unpractical: we’d need to manage the main vector inside our minds and imagine how it looks. Secondly, using multiple names allows us to easily pass arguments to instructions. Using a single vector for all symbols; where each symbol represents an index; will be counter-intuitive as it will result the disuse of push and pop. In fact, I think that the Data_t can be re-defined as a vector of vectors of registers…

Well, I’ve just spent some minutes thinking about this, and I think I’ve come to a conclusion. There shall be instructions within uint8_t limits, the same goes for the operands. An operand is a pointer to a register stack. The register stack is a vector of Register_t.

This limits the engine, for example, there can not be more than 256 unique “names” as symbols, however, it does allow the engine to use the array in a smart way. One could store a number in “index”. This number can be as large as any 64 bit unsigned variable. This variable can then be used to:

get eax index

What get does is that it fetches the Register_t at eax’s index at,.. well index. Then it stores the result inside get. By allowing this, the concept of an array is introduced. Yet this has gotten me thinking; can’t I just allow an infinite number of symbols to exist? It’s only a 3 byte overhead. If there really is code with that many symbols…

I still have to think about this before I refactor pretty much the entire project, since this is a big overhaul.

Edit:

It will probably not be worth the effort to change the entire instruction sequence, as it will require the storage of the variable containing a relocation of operands once we use jumps and conditional jumps. Surely, it can not be that hard, but it will be another place for bugs to manifest themselves. I’m taking no chances. This will be avoided until it is absolutely necessary.

Should Cxy be Compiled? ~ And Other Random Thoughts…

A thought recently crossed my mind as I was writing the parser for Cxy: “Should Cxy be compiled only?”. Saying that it can be “compiled” is slightly misleading, as the only thing the “compiler” does is:

    1. Replace gotos with integer jump points.
    2. Minimize names of variables to integers.
    3. Turn mnemonics into single bytes.

(Maybe it’s only an _assembler_…)

Interpreting Cxy is quite slow. In most cases, compiling and then executing is faster. The main reason for this is that the jump instructions are blazing fast after being assembled.
In addition, a switch-case (a compile-time jump table) can be used instead of linear if-statements for checking whether a string matches.

I do not think the compiling overhead will have much to say in regard to efficiency. If a commonly used piece of cxy needs to work quickly, it can be pre-assembled and stored in a .bxy file. The ‘b’ stands for “bytecode”. The Cxy virtual machine can then skip compilation and run the code directly. All-in-all, it seems like a nice solution. Not to mention that the virtual machine will be smaller and more compact. The current debug build of it is 15MB, which is very large for a 3k line project. It is assumed that it is due to GCC’s inability to efficiently link and work with ~80 source files. I guess I’ll have to build with MSVC to see how it handles the many files in the project. The release version is currently ~850 KB.

Should Strings Be Used To Access Registers?

Since I compile/assemble anyway; wouldn’t it be best to use raw integers to store data in the registers? Let’s keep in mind that std::string’s implementation – on most platforms – uses an internal std::size_t integer to keep track of the string’s length. If I were to use strings to store data, there could potentially be multiple characters to check against each other. The advantage of using a single number for each registers seems to outweigh the loss of possible space. I could use a byte string of a variadic integer, but that would imply some overhead, as I have to parse each sequence of bytes correctly. It may be better to simply store the full representation as an unsigned 64 bit value. 64 bits is quite large tho, I seriously doubt any cxy code will use more than even as few as 2^16 registers. Perhaps a variadic integer is preferred. The problem with using a 64 bit integer is that it does simply not work on some platforms. I’m thinking about storing bytecode as variadic integers and letting the implementation decide what type to parse it as, be it uint8_t, uint16_t, uint32_t or even uint64_t. Maybe some implementations will prefer working with the signed version. Perhaps it is faster than the unsigned version. Who knows.

Adding Instructions

I’m quite wary about adding instructions. The users of Cxy should not have to worry about these too much, as the instructions are tier0 Cxy, which one should not fiddle with. Or should one? I am planning on defining a header called “tier1” that can be included. This header will provide a single, thin layer of abstraction over the current assembly mnemonics. This means I can change the underlying assembly instructions without having to bother rewriting all the code, I’d only need to rewrite or replace some lines in tier1.

I’m thinking of adding instructions for handling a map register, or any other convenient storage structure. The current stack model is quite useful already, but I plan on the user being able to retrieve a position from the stack. The stack will thus become a dynamic arrays instead. The advantage to this is that we can for example store all locations of occurrences of the parsed string “cout”, and retrieve their positions by simply providing an integer of the first + X “cout” we would like to know the position of.

The Problem of Classes

A big problem for me is how to implement what’s known as a “class”. Currently there is 1 POD and 1 class; “unsigned integer” and “string” respectively (They do not necessarily represent unsigned int and std::string). These work together flawlessly because any unsigned integer can point to a location in a string! If I am going to implement floating point values, or negative values, then I risk losing both speed and safety. Is there any reason to implement floats or negatives at all?

Functions and Higher Order Classes

I’m definitely planning on implementing functions in the future. These could be defined by a “function” header. This header could translate all code that contains the keyword “fn” and replace it by appropriate gotos and returns. Higher order classes will be very interesting to me. What if it becomes possible to define a header that translates a “class” keyword (with appropriate data, of course)? I suppose I could go and instantiate a new object of a class, and even call its members. I think it’d be best to start with a lower form of a class. Namely a struct. It will be interesting to observe how I could create a function and pass a variable to it. Then I can go up one level and pass a struct object to it.

Integration with an IDE

One thing that crosses my mind constantly is that it’d be awesome if Cxy were to be integrated with an IDE. All a user has to do is click compile (or use a shortcut), and Cxy is automatically run before the compiler runs. Pragmaticality and practicality are important in a language’s usage, so I’ll definitely look into ways of getting Cxy integrated with my current IDE of choice, which is named “Code::Blocks”.

Oh well, when I was in the middle of writing this, I stabilized and finished the entire parser and compiler. It appears very stable now, I’ve tested it using multiple advanced setups and it produces desired results. The repository is updated, and can be found under the master branch here https://github.com/BourgondAries/Cxy