A draft submission to ACM Software Language Engineering Conference 2017
The purpose of software languages is to help humans communicate with machines. To achieve this, the language has to be regular and sufficiently well defined that both parties understand what the other is saying. Though contemporarily the computer is programmed in one language, and error messages have a different protocol (mini-language).
Homo-sapien language has been evolving since at least Mitochondrial Eve, she lived possibly one or two hundred thousand years ago. There are already over 6,000 human languages, so it is acceptable to add another one, which happens to also be a general purpose software language.
The goal of the software language, is complete vertical integration. So that if one is to reincarnate into a robot, then everything from the lowest to highest levels can be accomplished using the same language. Similar to how English can be used to communicate everything from the lowest to highest levels.
Several versions of implementing this idea have been made over the last decade. This paper uses Pyash as the word for the pivot language (Intermediary Representation), it means language and is the result of data mining world language vocabularies ([vocabulary]).
Natural English is not formal enough to be directly used as a software language, however Pyash English is. Pyash also acts as a bridge, for high precision translation, so Pyash English documents could be rapidly and precisely translated to Pyash Hindi, Pyash Spanish, Pyash Swahili, or any other supported human language.
This high precision translation could open the door to software languages for the majority of humanity which is not fluent in English.
This section will review some of the common references and mention how Pyash is different from them.
The main difference is that Pyash is based on the fundamentals of human language, and has a complete and orthogonal vocabulary.
COBOL, originally intended to be a business programming language, was designed by several committees, some of the committee members were unfamiliar with computer programming and-or linguistics, the committees also had issues with discontinuity of personnel. This led to a language that was neither very good for computer programming, nor very easy for humans to understand, while also having issues with repeals due to changing personnel.
By contrast, Pyash’s design takes into consideration many academic and real world sources for its grammar ([grammar]), vocabulary ([vocabulary]) and instruction set architecture ([ISA]). It’s evolution has also run through several different iterations though all with the same informed personnel to keep it on track.
Hypertalk uses English keywords to replace common programming syntax symbols, so is largely just a relexification of standard (ALGOL inspired) programming.
Pyash on the other hand starts with a human grammar base and then adapts it to usage as a software language.
Lojban is a language intended for human use, but based on the structure of programming languages, in particular predicate logic. Because of this Lojban is more of an API rather than a human language, making it very difficult to gain fluency.
While there have been some cursory motions towards making Lojban a programming language, none have gotten much past the concept stage.
The net result is that Lojban has not proven suitable for human communication, nor as a software language. Though it has been a useful stepping stone and point of inspiration.
Many different approaches have been taken to the creation of software languages. Rather than basing Pyash on the Chomsky Hierarchy of formal languages and formal grammars, it is based it on human grammar.
Linguistic Universals are patterns found systematically across large groups of languages, possibly all languages. In particular all languages have verb phrases and noun phrases, and mark their phrases either with placement, adpositions or affixes. All can also express tense, mood and aspect.
However there is the issue of making the pivot language. Which of the many options should the language use? To the rescue comes the World Atlas of Language Structures (WALS), which allows one to see what are the most common features around the world.
In particular Pyash is Verb-final, or Subject-Object-Verb word-order, similar to Hindi, Japanese and Amharic. Linguistic Universals point toward suffixes and-or postpositions for verb-final languages, so they are used.
But what of the grammar words themselves? A variety of contenders were reviewed, such as Universal Networking Language from the United Nations University, and FrameNet from Berkley. A more organic solution was chosen consisting of the list of Glossing Abbreviations used by linguists when transcribing foreign languages.
Contemporary Software Languages generally lack a root vocabulary. Keywords may have a special meaning, but they are typically of a syntactic or grammatical nature, so are at most a grammatical vocabulary. API’s naming convention of being series of unreserved letters, means that all unreserved words are proper nouns.
Pyash has a root vocabulary so that documentation, description and discussion can all happen in the same language as computer programming. The encoding requires API names to be words with a proper morphology ([morphology]), and may be restricted to only being official dictionary defined ones, ensuring standardization and ease of translation.
To generate the vocabulary first several word-lists were put together, including WordNet core, Oxford-3000, UNL-core, Special English, FrameNet, New Academic Word List (NAWL), New General Service List (NGSL) and Project Gutenberg Frequency List. After collating them all and taking out the duplicates, the language was left with almost 39 thousand words.
Google Cloud Translation API was used to translate each word on the list individually into the top 48 languages by number of native speakers. Giving an overall coverage of greater than 70% of the world population.
A script to sort the vocabulary based on the frequency list was made and it filtered them for uniqueness. Words were removed that were:
If more than 38%1 languages use the English term.
If it means multiple things in more than 38% of the languages.
If it is a homograph of an already defined word in any of the languages.
This left the language with a fairly orthogonal pool of about eight thousand words.
The pivot language needs to be sufficiently easily spoken by humans for it to be usable by humans in conversation. This was particularly the case in early prototypes, as it wasn’t realized that the pivot language could be used for translating between possibly all human languages — which would negate the need for actually learning the pivot language, a Pyash controlled natural language would be sufficient.
|a||ä||central open vowel||rm|
|b||b||voiced bilabial plosive||all|
|c||ʃ||unvoiced post-alveolar fricative||out|
|d||d||voiced alveolar dental||oor|
|e||e̞||mid front unrounded vowel||nter|
|f||f||unvoiced labio dental fricative||ire|
|g||g||voiced velar plosive||reat|
|i||i||unrounded closed front vowel||sk|
|j||ʒ||voiced post-alveolar fricative||garae|
|k||k||unvoiced velar plosive||eep|
|o||o̞||mid back rounded vowel||rbot|
|p||p||unvoiced bilabial plosive||an|
|r||r||alveolar trill||(Scottish) cud|
|s||s||unvoiced alveolar fricative||nake|
|t||t||unvoiced alveolar plosive||ime|
|u||u||rounded closed back vowel||bl|
|v||v||voiced labio dental fricative||oice|
|w||w||labio velar approximant||ater|
|x||x||velar fricative||(Scottish) lo|
|z||z||voiced alveolar fricative||oom|
|6||ə||mid central vowel|
First, an alphabet representing phonemes which are popular in human languages was required, for this PHOIBLE was used. Then WALS’ chapters on phoneme inventories was used to find what a common ratio of consonants to vowels is, as well as common number of consonants and vowels, and picked the most popular single phonemes which are reasonably distinct. Two tones were also included to increase the number of words. Two clicks were included for temporary document specific words — in place of acronyms. An ASCII letter for each IPA phoneme was also selected (Table [table:phonology]) to make sure Pyash is web compatible.
Second, a morphology of how the phonemes are put together to make words was required. For this phonotactics of the sonority scale was used, paired with the WALS chapter on syllable structure.
|CV||ka||/kä/||short grammar word|
|CSVH||kyah||/kjäʰ/||long grammar word|
|HCVC||hkap||/ʰkäp/||short root word|
|CSVC||kyap||/kjäp/||long root word|
/ʰ/ aspiration or spectrographically an unvoiced vowel.
a consonant of higher sonority than the preceding one.
a vowel (highest sonority).
The language was also made easily parsed even if there are no spaces or pauses between words. Each word is either two or four letters long. The two letter words start with a consonant and end with a vowel, and the four letter ones start with two consonants and end with a consonant (Table [table:morphology]).
The valid words were generated with several alphabets, and a script was made to assign words based on the phonemes in the source languages weighed by their representative native speaking populations. The highest frequency words were assigned to the easier to pronounce and understand smaller alphabets. And the more rare words were assigned to the more difficult extended alphabets — with voice contrast and-or tones for instance.
For complete vertical integration the language has to boil down to machine level instruction, or an instruction set architecture. The JVM bytecode is an example of a different language which can also be implemented as an instruction set architecture.
Understanding that the future of computing is going towards parallelism much research into how to make the language as parallel-friendly as possible was done. In particular the Heads and Tails ISA was found to be quite inspiring.
Each Pyash word fits in sixteen bits (a uint16_t). There are four word types and one quote type which are encoded. The quote type allows for including literals.
Pyash English do say the quoted’word’hey world’word’quoted.
Pyash zi.wo.hwacwu.wo.zika hsactu
Codelet 0051 291D E928 28BE 245E E948 295E 0000 0000 0000 0000 0000 0000 0000 0000 0000
Codelet Explained (0051 index) (291D quoting two words) (E928 28BE hwac wu) (245E ka accusative-case) (E948 hsac say) (295E tu deonitic-mood) 0000 0000 0000 0000 0000 0000 0000 0000 0000
Output with en_US locale hey world
Output with ru locale эй мир
For parallelism sentences are encoded into codelets, which are comprised of one or more vectors of sixteen, sixteen bit values. The first sixteen bit value of a vector is the index for the vector, marking the location of grammatical cases and moods (ends of noun and verb phrases).
mina ryopyi syutka kwinli
me NOM robot DAT liberty ACC giving REAL
I be giving the liberty to robot.
This encoding can then be translated to any supported human language (Table [table:translation]). In terms of compiling to a programming language, it compiles to OpenCL C. There is also a design for making a code-parallel virtual machine, that can process linear code on GPU’s using Pyash ISA.
The encoding could also be used for storage of information, similar to a database, as well as for knowledge management, similar to how human languages are used for storing information.
The parser is probably of some interest due to its refined simplicity. It is a hand coded, single pass type, modeled on how a human would parse text. There are no parse trees or any such complexities.
First the parser checks if a word is a valid Pyash word, if so, then checks if it is a grammatical-case word, a grammatical-mood word or a quote word, if not then simply adds it to the codelet.
If it is a quote word then acts accordingly either upon the literals ahead or the words behind, adding what is necessary to the codelet, and adjusting the codelet and text index pointer to just after the quote.
If it is a grammatical-case word, then in addition to adding the word to the codelet, also marks it on the index.
If it is a grammatical-mood word then does as with the grammatical-case word but also ends the codelet. With the exception of the conditional mood, which is treated the same as a grammatical-case for encoding.
For reading and writing to the codelet there is a function, which manages which vector is being added to. If the addition over-runs one vector, then it’s index is inverted, and the next vector receives the additions. This way when reading indexes, it is known if it is the end of the codelet based on the first bit of the index — if it is a one then it is the final vector.
This simple parser/encoder could parse/encode sentences in parallel, and should be adaptable for parsing spoken streams of phonemes. A more complicated version of the parser/encoder will be necessary once support is added for subordinate clauses, since they would have to be broken up into multiple codelets for the encoding.
Various variations of the language have been worked on since 2007. The first implementation was in Haskell and second was in Java, both were recursive parsers.
The third implementation followed the Jones Forth model, hoping to bootstrap something small and scaleable, so Intel assembly was used for a few years and succeeded in making a basic interpreter.
The fifth and current attempt it was motivated by the realization that something fast, scaleable and future-friendly was needed, so a parallelizeable ISA ([ISA]) was designed and the implementation was done in OpenCL C. As of this writing (May 2017) it compiles hello world, does variable assignment, for loops, and function declarations are being implemented.
While the main focus of the current implementations has been computer programming languages and related documentation. The language can be used to cover the areas of other software language types as well.
For example SQL database access and creation language, can easily fit as a subset of Pyash, with some slight vocabulary changes (Table [table:SQL]). Due to this rather fortunate grammatical-case design of SQL it should be possible to translate from SQL to Pyash and vice-versa — whereas with most placement based parameter family of languages it is a non-trivial process.
For knowledge representation or ontology languages, the databases could simply be made of Pyash codelets. They could be rapidly queried in parallel on GPU for any particular piece of information. They could be translated to and from human language, for sharing gathered knowledge with humans, or acquiring knowledge from humans.
Even a few people having a conversation, such as at a meeting could generate programs and-or machine knowledge if they were speaking with enough formality to be Pyash accessible.
Pyash accessibility is currently rather low, having a rather strict grammar. But with machine learning algorithms to help with converting natural language speech into Pyash controlled natural languages the amount of machine accessible knowledge that could be harvest from the spoken and written word should dramatically increase.
Considering that Gellish is a modeling language, and that Pyash has a much more developed grammar, it should be fairly straightforward to adapt Pyash to be a universal modeling language.
For visual people, graphics could be generated from Pyash descriptions. So in the hypothetical scenario of some people talking in a meeting, the computer could be projecting the model of what is described on the screen. Or running and showing simulations to see the potential outcomes of various policy or program changes.
The majority of domain specific languages seem to have placement based parameters. This means that reading the API is likely necessary to understanding how to use any functions. Thus, unless the API is written in Pyash or some other machine-accessible format, translating to and particularly from those languages to Pyash is non-trivial.
Translating to those languages is easier, as a human can read the API and make an appropriate Pyash side function to access it. However if someone adds a new function to that other language, without following something like the Pyash function naming convention, then it will be nearly impossibly to translate to Pyash without reading it’s corresponding API and-or analyzing it’s code.
Possibly when machine learning and AI gets sufficiently sophisticated it will be able to do those translations, but that is quite possibly decades away.
For now it makes sense to limit official Pyash programming development to compiling to popular C libraries, and also making native libraries.
There are a wide range of communication protocols, all serving their own niches. For example, HTTP, SMTP, and IRC.
With the advent of XML there was an increase of protocol creation, for example XMPP, SOAP and XML-RPC. However since XML doesn’t have a root vocabulary most of these different protocols have different naming conventions and so are not easily inter-operable.
XML is also rather bulky, so in certain areas, such as configuration and data storage, more compact alternative such as JSON, Lua and YAML have gained. Though like XML, they lack a root vocabulary.
Pyash does have a root vocabulary so it is fairly straightforward to use as a communication protocol. Having the root vocabulary could encourage people to extend the language rather than make entirely new protocols. The binary encoding of Pyash, which can store various types including binary data, is both compact and can be decoded into a human readable format in a variety of human languages.
In terms of usage of space, Pyash is likely to be more bulky than any of the early terse ones like HTTP, but will typically use less space than XML, approaching JSON or YAML — depending on the length of names used.
The goal of using Pyash for protocols is making it easier to collect, consume and process large amounts of data. Especially now that many of us have more storage and processing power than we know what to do with. For example, may people have powerful GPU’s in their computers, which most of the time sit relatively idle.
encoding:570:text_encoding debug text
from encoding file at num five seven zero line in text encoding cereomony the debug text be emitting.
kfinhfaspwih hfakhsipzrondo lyinlwoh htetkfinsricnwih dyekhtetka mwa7nli
Since the error reporting was mentioned earlier, here is an example (Table [errorMessage]). Though the Pyash versions are longer, they are more portable, and non-English speaking people can help debug the program, as the Pyash could be translated to an approximation of their native language.
Additionally a variety of protocols could be translated into Pyash, not necessarily so they would be faster, but to make it easier for an AI or AGI to understand and communicate using them.
LaTeX, HTML and Markdown are some of the most popular markup languages on the internet today. Of course they are mostly for formatting, and do not include a vocabulary for the content.
However for writing modern documents, it is often important to have chapters, sections and subsections. Spoken speech has an (arguable) analog of bold and italics, via the focus and topic of the sentence — which is already a part of Pyash grammar. However spoken language generally doesn’t have long enough monologues for people to even mark their spoken paragraphs.
The grammar of Pyash could be extended enough to allow for such mark up. An example would be to make a grammar word for paragraph, module (section), and frame (chapter).
Pyash as a markup language would be particularly useful in using Pyash for writing international content, such as stories, news articles or even legislation.
A software language based on the fundamentals of human language that is usable for human communication and computer programming is certainly viable and implementable, as it has been done.
Translating all or most human languages, or at least controlled variants of them does on the surface appear viable. Though further research would have to be done to see what level of conjugation is comfortable for, and how long it would take for native language speakers to adapt to the controlled variants.
Translating everything between software languages is unfortunately not viable due to the much smaller scope of them, as they can’t be used for human communication. Though existing codebase can be used via foreign function interface.
Complete vertical integration of everything that a computer might need to do seems to be viable, though further work would need to happen to prove it.
This implementation of the language seems to be satisfactory. Language adoption is a major hurdle, which motivates this article. Pyash is being used to write an automated programmer to more quickly write the standard libraries, and general intelligence operating system to follow.
2 − ϕ = 38% where ϕ is golden ratio or 1.618. A golden fraction was felt to be a natural choice.↩