SOCware Tools

GitHub: A C-based Implementation of the Vigenère Cipher.

Background

To skip the full background and to get straight to the Vigenère Cipher, please go here.

Cryptology, Cryptanalysis and History Primer

Suprisingly or unsurprisingly, cryptology is a field of significant fixture in history, spanning centuries and continuing to evolve in-line with technological capabilities. The earliest-known manifestation of cryptography can be found with egyptian hyroglyphs, an inscription technique from approxmately 3000 BC. This had adopted a codebook of sorts to map plaintext to (fixed) enciphered symbols.

Since the supposed much-contended, first-recorded instance of plaintext recovery in the ninth century (UCL Article), the complexity of enciphering techniques have improved greatly with the tradecraft of civilisations. This brings about the nextmost popular historical cipher, the Caesar Cipher, in 100 BC. This is the first-documented instance of a substitution cipher, wherein the characters of a given plaintext are replaced with alternate characters based upon a dynamically-decided key. In the context of the Caesar Cipher, the key is an integer representing the number of positions in the alphabet in which to ‘shift’ each plaintext character by (all of which are within the latin alphabet). This is decided ahead of tine between confidants. For instance, suppose we decide on the key $3 </mrow>$ - which can be described as a Caesar shift of $3 </mrow>$ or ROT3 (shorthand for rotate by $3 </mrow>$ );

Plaintext:  | a | b | c | d | e | f | [...]
Ciphertext: | d | e | f | g | h | i | [...]

Here, each character in the plaintext - let’s use Hello - is shifted right by three places to yield Khoor. Conversely, to recover the original plaintext, the ciphertext can be simply shifted back by three places;

Ciphertext: | a | b | c | d | e | f | [...]
Plaintext:  | x | y | z | a | b | c | [...]

As you can see, if we break outside of the alpabetic space (beyond Z or before A), this wraps around either from A or Z. Mathematically, encipherment and decipherment can be expressed as follows, supposing that $M </mrow>$ corresponds to a character of the original plaintext, $C </mrow>$ the enciphered character, and $K </mrow>$ the key;

C = (M + K) mod 26

M = (C - K) mod 26

The $\mod 26 </mrow>$ is modular arithmetic that ensures that $C </mrow>$ / $M </mrow>$ remains within the alphabetic space and wraps around, with $26 </mrow>$ being the total length of the alphabet. In practice - at least millenia ago - this effect was achieved through a wheel/disc.

While trivial in modern day, this was once possibly used to conceal strategically and operationally-significant information for historical events including The Gallic War, though more primarily was adopted for private correspondence (Ancient Cybersecurity II). This is monoalphabetic in nature, meaning that all plaintext characters have a fixed replacement structure - that is, with $K = 3 </mrow>$ , A is always guaranteed to correspond to D, B to E and so on until a new key is decided upon. This presents concerns that, should the key be attained by adversaries, this is always guaranteed to decipher the ciphertext back into the original plaintext.

In spite of an unknown key to adversaries, it is also possible to employ other cryptanalytic techniques to completely compromise the cipher. Most popularly, brute force and frequency analysis. The first technique is possible given, in actuality, there only exist 26 possible keys in which to shift the ciphertext characters by - and, given all characters are shifted by the same key, it takes little time to systematically compute all possible plaintexts. Take the previous example Khoor;

K = 1 -> Lipps | ✗
K = 2 -> Ifmmp | ✗
K = 3 -> Hello | ✓

Given how small the keyspace is, this can also be conducted by hand albeit more tedious. Mathematically, the effort or keyspace (that is, the set of all possible keys), can be expressed as $26! </mrow>$ (26 factorial), which simply covers all integers that are less than or equal to 26.

Frequency analyis is also feasible with the Caesar Cipher. Given the insufficient confusion and diffusion, crytographic properties representing the relationship between plaintext-ciphertext, in part or in full the plaintext can be inferred. How this is done is by inspecting ciphertext (assuming this is a ciphertext-only attack) for conventional language patterns. Take our previous ciphertext Khoor. Note the oo here, which corresponds to the plaintext ll. The double-l is one of several double-letter digraphs in the field of orthography alongside double-e and double-o, which we can use to limit the possible characters o can correspond to. We can look for other digraphs, which are combinations of two vowels/consonants (i.e., ‘-ch’). There are a variety of other occurrences which one could check for (namely, bigraphs, diphthongs, so forth), of which modern day tooling can readily analyse masses of text for.

We can also look for frequently-occurring characters, and correlate these to the most prominent letters in the respective alphabet. In the case of the English language, the letter E occurs the most among its counterparts in many texts - this is conveyed in a 2003-2004 sampling of 40,000 words by Cornell University. Subsequently, assumptions can be made, should the ciphertext comprise sufficient enough words/characters, about particular letters and what these actually correspond to. Take the ciphertext “Khoor wkhuh, wklv lv flskhuwhaw surgxfhg iurp wkh Fdhvdu Flskhu”. Courtesy of dcode.fr, the letter ‘h’ has the highest (16.98%) frequency distribution, strongly implying this could be the plaintext letter ‘e’. This is confirmed by the original ciphertext “Hello there, this is ciphertext produced from the Caesar Cipher”.

The Caesar, or ROT, Cipher in practice is highly susceptible to compromise and as such shouldn’t be used in a serious capacity, as hopefully exhibited by the above cryptanalysis techniques. Woeful (past and present) adopters in the wild;

(Most Popularly) Netscape Navigator
Fujifilm Multifunction Printers

Introduction to the Vigenère Cipher

OK - so why is the above relevant to the Vigenère Cipher?

The Vigenère Cipher, the name of which referencing the French-based diplomat Blaise de Vigenère but referencing primitives long-before established in 15th-century literature, is a polyalphabetic substitution cipher. This form of cipher, contrarily to the monoalphabetic kind, applies a seperate shift to each plaintext/ciphertext character giving each an individual replacement structure - as opposed to a single substitution applied to all characters. Therefore, instead of there being a possible 26 keys by which to encipher/decipher by, there exists 26 different keys for each character, which can be expressed as $26^{n}$ where $n$ equals the number of letters. As described in declassified NSA documents and Blaise de Vigenère’s text, a keyword was used as a design decision for memorability. The numeric representation of the number (in the 26-letter English alphabet, A $= 0$ , B $= 1$ , so on) is then used to rotate/shift the plaintext or ciphertext by.

Suppose we have the plaintext “Hello there, this is ciphertext produced from the Vigenere Cipher” and key “key”. Excluding punctuation, the substitution of each character occurs as follows;

Plaintext:  |  H  |  E  |  L  |  L  |  O  |  T  |  H  | [...]
Key:        |  K  |  E  |  Y  |  K  |  E  |  Y  |  K  | [...]
Key Digit:  |  10 |  4  |  24 |  10 |  4  |  24 |  10 | [...]
Ciphertext: |  R  |  I  |  J  |  V  |  S  |  R  |  R  | [...]

Notice that the length of the key is exceeded by the length of the plaintext. In cases like these, the key continues to repeat until this reaches the length of the plaintext - therefore, giving “keykeyk...”. Similarly to the Caesar Cipher, this is mathematically expressed as follows, except where $n$ corresponds to the current letter;

C_{n} = (M_{n} + K_{n}) mod 26

M_{n} = (C_{n} - K_{n}) mod 26

Comparatively to its monoalphabetic counterpart, this cipher offers significant improvements in cryptographic strength, and subsequently resistence to cryptanalytic efforts. The $26^{n}$ keyspace takes exponentially more time to traverse and exhaust - in computational time complexity terms, this is expressed as $O (26^{k})$ to indicate that the practical effort grows subsantially grows even if the length of $K$ is simply incremented. Moreover, the previous frequency analysis techniques explored would prove ineffective given the aforementioned language patterns would be less likely to manifest consistently - even occurrences like double-o could appear as, say, ‘og’ or ‘az’ depending on the key.

So, at first glance it appears that the Vigenère Cipher offers perfect secrecy - where the ciphertext and plaintext relationship is completely unclear - at least theoretically. However, this is not always the case with one glaring susceptibility - the key length (StackExchange).

The Vigenère Cipher remained unbreakable for centuries, used in historically-significant events including the 1861-5 American Civil War. The first documented (notwithstanding the undocumented work of Charles Babbage), contested, successful cryptanalytic technique for the cipher however was ascribed to Friedrich Kasiski in 1863 who, leveraging frequency analysis, could glean the length of the key used in encipherment - this is known as the Kasiski Test.

Now, why is ascertaining the length of the key critical to undermining the cipher’s cryptographic strength? This reduces the ciphertext recovery efforts particularly in brute-forcing, in that the added iterations required for attempting arbitrary key lengths on top of exhausting each key variation of a given length is no longer neccessary. Here, only exhausting keys of a single length is required, thereby reducing computational overheads. This is especially prudent for large, ciphertext-only scenarios.

How the aforementioned Kasiski Test works is by observing the distances between repeated portions of the ciphertext - substrings of, say, a length of 3 to rule out incidental occurrences, such as ABC (MTU). Suppose $K =$ ”KEY”, and $M =$ ”THE DOG WAS THE SUSPECT ALL ALONG” (arbitrary but beneficial for the example in terms of working with the key length). With $C =$ ”DLC NSE GEQ DLC CYQZIAD EJV EJYRE”, we can observe clear repeats for “DLC”, which in a known plaintext scenario we know deciphers to “THE”.

Observing the procedures for the test, we calculate the physical distance between the two “DLC”s as $9$ , which the greatest common divisor (GCD) will provide viable key length candidates for. Given there is one number (GCD requiring two or more numbers), we can simply find the factors - $1$ , $3$ (ah-ha!) and $9$ .

While effective here and in numerous cases, this soon becomes problematic as the ciphertext message and (hopefully) key length becomes substantial - particularly if the key length reaches that of the plaintext. That said, futher analytic techniques have since been realised since to suit a variety of scenarios. Namely and popularly, the Index of Coincidence for poly/monoalphabetic substitution cipher identification (truly testing the tenets of Kerckhoffs’ Principle), Hill-Climbing for resource tradeoff.

Proof-of-Concept Compilation

GNU/Linux using GCC

$ gcc vigenere.c -o vigenere
$ chmod +x vigenere
$ ./vigenere

On NT using the VS Developer Command Prompt

$ cl vigenere.c
$ .\vigenere.exe

Proof-of-Concept Usage

usage: ./vigenere [-h] "message" [-m MODE] [-k "KEY"]

positional arguments:
      message  specifies the message to encrypt/decrypt (A-Z, a-z).
      -m       encrypt/decrypt the subsequent message.
               (0 = encrypt, 1 = decrypt, 0 = default)
      -k       specifies the keyword to use (variable length, ASCII-only).

optional arguments:
      -h       displays help message and usage information.

Disclaimer

As signposted above, this is a classical cryptosystem, utilising a trivial primitive that the application of in serious context(s) is discouraged given its well-documented lack of resistence against modern computational (and even human-driven) analytical techniques. This should not be considered a crutch for ample security hygiene, and instead advice should be sought from appropriate guidelines - such as NIST’s FIP-approved cryptographic algorithms (SP 800-140Cr2 as of writing). As such, please do not expect this tool to meet your commercial needs; this is purely for research purposes.