Table of Contents
Open Table of Contents
How Asemica Operates
Before we begin I’d like to state that the algorithm was formulated by a person called Danne Stayskal and the original implementation (in Perl) is available in their GitHub profile. I cannot recall how I found it initially but it was only until recently that I decided to try and figure out how it works by re-implementing it in a language I am more familiar with.
The reason I waited until recently to start my re-implementation effort was because of needing to learn a bit of Perl to use for scripting on Linux. Perl is a bit famous (infamous?) for being hard to read but after only going through a few chapters of Learning Perl and a healthy amount of comments in the original source for Asemica, I was able to understand how it works enough to write my own version.
Asemica relies on a regular document of text as a key. In order to understand how it works it is necessary to understand how a document is used as a key. The first step of the Asemica algorithm is turning the document key into a table of unique words present in the document. For any given word we keep track of all the words that appear directly after it anywhere in the text and how many times they do. For ease of reference the sequence of words-that-follow will be called doors and exits when the sequence includes the occurrences.
After this mapping of words and their associated information - which we’ll refer to as the transition table - is created, those words with 15 or more different words following it are considered meaningful as they are able to represent a half a byte of information. All others are considered meaningless. This table is used to carry out both encryption and decryption.
Encryption
During encryption the plaintext is converted into a sequence of bytes which are further converted into binary strings then grouped into half-byte numbers (nibbles)1. Then beginning a randomly chosen word from those present in the document key which we will refer to as a token the following series of steps is repeated until the nibble sequence is empty:
- The current token is appended to the ciphertext result string.
- If the token is meaningful then a single value is popped from the sequence of half-byte numbers and its integer value used to index the list of doors. This word becomes the next token.
- If the token is not meaningful we randomly pick one of the doors and use that as the token in the next iteration.
As we can see moving from one token to the next is wholly dependent on the value of the previous token which is exactly the type of relationship that a Markov chain describes. The probability of moving to a state in doors is
{{ “1/n” | formatMath | safe}}
where n is the length of the doors sequence and that of moving to a state not in doors is 0.
Decryption
During decryption, the ciphertext is first split into a list of words. The words list is iterated over until the second last element and the following steps are carried out:
- Check if the current word is meaningful skipping it otherwise since it could not have been used to encode a nibble.
- Try and find the next word’s index within the current word’s doors.
- This index is converted into a binary string and appended to the plaintext result string.
At the end of the loop the plaintext string is just a long binary string and not the intended message. To convert back into whatever was encrypted the binary string is divided into groups of 8 (binary) characters. These characters represent a single byte and are converted back into the ASCII value they represent. The resulting string is the original message.
Concrete Implementation
To implement Asemica I chose the Factor programming language which is described on the homepage as:
a concatenative, stack-based programming language with high-level features including dynamic types, extensible syntax, macros, and garbage collection.
Apart from the features listed above one more that I think is useful is the listener - basically an enhanced REPL - which is helpful for interactive development.
Representing A Transition
The first detail to work out is how to represent the information present in each entry of the transition table. For this we’ll use a Tuple which is similar to data classes in other languages. The syntax is pretty straightforward2:
TUPLE: transition
{ token string } ! raw string as it appeared
{ seen integer } ! how many times token has been seen
{ door vector } ! words after this one in document key
{ doors integer } ! length of door vector
{ exits assoc } ! words after plus occurrence count
meaningful ! true or false depending if doors > 15 ;
The following is a transition value for one word. The key file used was Metamorphosis by Franz
Kafka downloaded from Project Gutenberg as a text file:
{
"token": "eBook",
"seen": 12,
"door": ["of", "is", "or", "Details", "First", "METAMORPHOSIS", "complying", "for"],
"doors": 8,
"exits": [
["of", 2],
["is", 2],
["or", 2],
["Details", 1],
["First", 1],
["METAMORPHOSIS", 2],
["complying", 1],
["for", 1]
],
"meaningful": false
}
Generating the Transition Table
The next step is actually building the transition table. To do this as we saw earlier requires
iterating through all the words in the document key and checking which word follows each and
updating the relevant transition tuple. Rather than using an indexed loop on a single stream of
tokens I instead opted to use the 2each combinator3 which requires two sequences and a
quotation on the stack. It is roughly equivalent to the following Python snippet:
def two_each(seq1, seq2, f):
for a, b in zip(seq1, seq2):
f(a, b)
# example usage
two_each(range(5), range(1, 5), print)
# example output
0 1
1 2
2 3
3 4
We can already guess how to pair up each word with the one that follows it and run a function on it4.
: 2tokens ( str -- a b )
clean-string " " split dup rest ;
:: generate-transitions ( a b -- table )
{ } >alist :> ttable! ! create empty transition table
a b ! tokens
[
:> value
:> key
! get transition tuple if it exists or make new one
key >lower ttable new-or-existing
key >>token
! update with information
value swap update-transition-info
! modify the transition table
key >lower ttable assoc-add ttable!
] 2each ttable set-meaningful ;
The 2tokens word produces the two sequences we need from a string after first cleaning the string
(this is just removing extra spaces and newlines among other things). These two sequences are used
as input to the generate-transitions word. Unlike most other stack-based programming languages,
Factor allows using lexical variables ( ::, :> ) which can be a lot more convenient than
manipulating multiple different stack values.
Encryption and Decryption
The code responsible for encryption is not very complicated if you keep a few of these Factor specific things in mind:
- How lexical variables (including mutable ones) work .
- Arguments to words are taken from the stack.
lengthdoes what you expect andindex array nthisarray[index]in Python.key assoc atis the same asassoc[key]whereassocwould be a dictionary in Python.seq unclipputs the first element and the rest of the sequence on the stack in reverse order.meaningful>>is just how you access the slot namedmeaningfulin a tuple5.
The code for encryption is:
:: encode ( message transitions tokens -- encmessage )
! start with a random token
tokens length safe-random tokens nth :> token!
"" :> encmessage!
! transform input into sequence of nibbles
message unpack nibbles :> nbs!
[ nbs empty? ] [
{ encmessage token " " } concat encmessage!
token >lower transitions at meaningful>>
[
! encrypt a nibble into a word
nbs unclip
token >lower transitions at door>>
nth token!
nbs!
]
[
! add random word to output without encrypting nibble
token >lower transitions at doors>> safe-random
token >lower transitions at door>> nth token!
] if
] until { encmessage token " " } concat ;
For the decryption word other than what we have already mentioned keep in mind:
- Although iteration over structures is preferred over indexed iteration the latter is still
possible by using
<iota>(similar torangein Python) with theeachcombinator. findtries to find a value in a sequence that satisfies a quotation and puts the index and the value on the stack or false for both.whenis like an if statement without an else section.
The decryption code is:
:: decode ( input transitions tokens -- decmessage )
"" :> decoded!
input clean-string " " split :> words
words length 1 - <iota> [
:> i
i words transitions at-nth meaningful>>
[
! word encodes a nibble so try and find its value
i words transitions at-nth door>> [
>lower i 1 + words nth >lower =
] find
:> elt
:> idx
elt [ idx make-nibble decoded swap append decoded! ] when
] when
] each decoded pack ;
One final detail necessary for understanding how encryption and decryption is done is the
implementation of pack and unpack. They are written to mimic the behaviour of pack and
unpack in Perl with b* as the template string. Here’s unpack:
: unpack ( str -- newstr )
[ >bin 8 zero-pad reverse ] { } map-as "" join ;
It converts a string such as “hello” into a binary string such as
“0001011010100110001101100011011011110110”. It does this by converting each character into a binary
string (>bin) then padding the string with zeros until it is at least length 8 (8 zero-pad). It
then reverses the string to turn it into little endian form (least significant bit first). The
result for each character is concatenated together into a single string.
pack is slightly longer:
:: pack ( str -- seq )
0 str length 8 <range> [
:> start
start 8 + :> end
end str length >
[ str length ] [ end ] if
start swap str subseq reverse bin>
] { } map-as but-last >string ;
It converts a string such as “0001011010100110001101100011011011110110” back into “hello”. It does
this by iterating through the string in groups of 8 characters (representing 1 byte / a single ASCII
value). The 8 characters form a string which is reversed (to turn it back into big endian form)
before converting it back into an integer (bin>). The result is a sequence of numbers which can
then be converted into an ASCII string (>string).
After adding a few other bits and pieces like argument parsing typical usage of the program is shown below:
$ echo "Pwagu hupata pwaguzi" | factor-vm -i=cust.img asemica.factor \
-m enc -c metamorphosis.txt -o enc.txt
$ cat enc.txt
them It was ever was ever mentioned as they have expected something to
make use several hours on his many steps and if somebody came to make
use part and with all at seven I'll get enough of luxury For help her
breast His bed perhaps any moment and when her Would go of an inch His
right to just lay flat that gentle voice as clearly asked each other
Whatever he thought It
$ factor-vm -i=cust.img asemica.factor -m dec \
-c metamorphosis.txt -i enc.txt
Pwagu hupata pwaguzi
Appendix
- The entire code is available on GitHub.
- As described in the
READMEof the GitHub project generating an image and passing it as an argument to the factor VM can significantly improve runtimes. - The choice of document key can lead to shorter ciphertexts with a longer text generally being better and some texts being completely unsuitable for use as a key.
- A very good introduction to Factor is available here.
- This article describes some of the reasons why concatenative programming is useful.
- Factor has a lot of libraries so browsing the documentation can usually help you find what you want.
- On Linux
Alt-+increases the font size for the help browser and running"Iosevka Custom" 25 set-listener-fontsets the font and its size for the listener.
Footnotes
-
The bytes represent the value of a particular character in ASCII although Unicode should work as well by changing how decoding is done. ↩
-
The types for tuple slots are entirely optional but are useful for documentation. ↩
-
A combinator is a word (function) that takes code as input usually in the form of a quotation which is an anonymous function. ↩
-
The first sequence is all the tokens and the second tokens is the same sequence without the first element. ↩
-
This would work for any tuple with a slot called
meaningfulnot just ourtransitiontuple. ↩