An Unusual Rail Adventure

Part 1: Custom Encodings in Swift

Alexander Ignatov

Published in

Stackademic

12 min readAug 28, 2023

Idea

I used the national railways one day.

Of course, it was not the first, nor was it the last time I relied on their services to travel between towns. In fact, I often do, as the prices are the cheapest (compared to buses and even cars), and they also offer a sleeping wagon on the route to my hometown.

Rail transport in my country is notorious for its delays which span over minutes, and oftentimes whole hours. And that day was no different — shortly before the scheduled arrival we were informed that the train is running late. Only 20 minutes though, no biggie.

What fascinated me back then was an idea that came to my mind when my girlfriend showed me a website I had no idea that it existed. It was https://rovr.info and displayed in real time (1-second refresh rate) the actual amount of delay of all arriving and departing trains for a given station. Its 90s-style interface made it look like this:

A screenshot of the mobile interface of ROVR.info

The thing that made it interesting for me is that the official site of the national railway company (БДЖ, which in English is usually also transcribed as BDZ) does not have such functionality. But where does ROVR get this info from, it must have been ordered and made for BDZ, although there is no mention of BDZ anywhere on the website and the copyright in the footer has some person’s name only (L. Mihailov)? (I later learned that workers on stations use this software, so I guess my hypothesis is somewhat correct)

Anyway, I was both surprised and not surprised at all by the UI and UX of this site, knowing the condition of all state-owned software. Not to mention that it sometimes loses ability to display anything. Also if you don’t block the location sharing, it can ask for permissions every single second.

It was obvious that it wasn’t made for public use. And that’s where my idea emerged: it looked as if this simplistic HTML wouldn’t be so hard to parse.

And it turned out I was right… and wrong, at the same time.

Reverse engineering

So, the HTML structure itself is quite straightforward indeed — just a <table> with rows <tr>and columns <td>. For each train there are 3 rows with a few columns of usable information:

Selecting a few different <td> elements with train schedule info.

The request that gives us this data is a simple POST request to https://rovr.com with a few fields in the body. Right-click on the request in the network section of the dev tools gives us a curl command that we can tinker with:

curl 'https://rovr.info/' \
-X 'POST' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Accept-Language: en-GB,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Host: rovr.info' \
-H 'Origin: https://rovr.info' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Safari/605.1.15' \
-H 'Referer: https://rovr.info/' \
-H 'Content-Length: 59' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: document' \
--data 'orientation=L&mobver=1&active_View=2&station_id=18&scrpos=0'

Among properties in the body of the request, only station_id catches the eye. Apparently, 18 is the ID of the Sofia central station. But how would I get all of them?

Luckily, there is a <select> with <option> ‘s for every station, binding them to their respective id's:

The <select> with all supported stations, bound to their ID’s.

A simple Find+Replace with regex enabled would be enough for me to extract all of the id’s in the format I need (an enum in the code, etc).

I tested a bit and replaced 18 in the above request with a different station ID and of course, I got a correct new HTML result for the corresponding station.

The request that gives us the nearest station was a little more “user-unfriendly”, if I can put it that way:

curl 'https://rovr.info/rovrandr.php' \
-X 'POST' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Accept: text/html, */*; q=0.01' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Accept-Language: en-GB,en;q=0.9' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Host: rovr.info' \
-H 'Origin: https://rovr.info' \
-H 'Content-Length: 84' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Safari/605.1.15' \
-H 'Referer: https://rovr.info/' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: empty' \
-H 'X-Requested-With: XMLHttpRequest' \
--data 'user_alive=0&nav_data=9999999999_714815963_42.{REDACTED}_23.{REDACTED}_null'

I have no idea what 9999999999_714815963 refers to, but the next two numbers are obviously my geographical location (which I redacted above to hide the decimal places for the sake of my privacy).

The response seems even more enigmatic:

1693124398 18 5568.1310114549 30690 5617.7951004903

The first number (1693124398) looks like a Unix timestamp, and if I convert it to a date I get exactly the moment ago that I received this response.

The second number seems familiar — we discovered a bit earlier that 18 is the ID of the Sofia central station.

As for the other three — I ain’t got the faintest idea. If anyone has, I’d be interesting in hearing it! We won’t need them though, the response already gave us what we requested — the nearest station to the location we sent.

So now we have everything we need to make a nice app out of this, right?

…right?

Encoding

As suggested by both the server response headers and the HTML <head>, the response is not using Unicode:

<meta http-equiv="content-type" content="text/html; charset=windows-1251">

This shouldn’t really pose any problem in theory, as in Swift the initializer of String with Data should be provided with the correct encoding in which the data is:

String(data: ..., encoding: .windowsCP1251)

However, this returns nil for the HTML document that https://rovr.info gives us.

But… why? This really is the correct encoding, even Visual Studio Code suggest me to switch to it and visualizes correctly:

Visual Studio Code confirms that the HTML document is using Windows-1251 encoding.

I was left scratching my head.

The creators of the Swift language have decided that the initializer should be failable instead of throwing a descriptive error. I had no clue why it returned nil while the document is clearly using this encoding. Maybe there a symbol somewhere that couldn’t be decoded? Why wouldn’t the initializer just put the good old replacement character � instead of the faulty character and return what it can?

I could find a suspicious string inside a meta tag of the document but it was clear that without cutting parts of the HTML document this initializer would be impossible to use. And iterating through the raw data bytes and figuring out where to start parsing seemed like a nightmare for me.

I searched for other ways of initializing a String from Data and I found an ‘elder brother’ of that init:

This one accepts some collection of code units (Data is a collection and seems to do the job) and a type that conforms to the _UnicodeEncoding protocol. What would such a type be?

I am unfortunately unable to click on _UnicodeEncoding . That underscore screams ‘private’ in your face, and private it remains. Good job for Apple’s documentation again — how can I use something if I don’t even know what it is?

Searching for it in the docs seems impossible at first because the results are about other types and symbols. One has to really dig into what’s inside the Unicode namespace in order to find an article with the protocol requirements.

I really wished that there would already be a Windows1251 implementation of that protocol. Searching more in both the documentation and String ‘s Xcode-generated interface, I found that it is typealias-ed to Unicode.Encoding and had only 4 implementations:

So, a dead-end again. At least for applying an out-of-the-box solution. Searching the web for custom solutions also wasn’t very helpful.

At this point I knew I had to get my hands dirty trying to write my custom Windows1251 encoder/decoder.

Cowabunga it is.

Windows-1251

What even is this encoding?

Tech-savvy people using the cyrillic alphabet might be familiar. Watching movies as a kid, I remember having to go the settings of my media player from time to time and change the subtitles encoding to/from Windows-1251 so that they don’t for example look like this:

Ŕâňîěŕňčçčđŕíč Ńúîáůĺíč˙ çŕ Äâčćĺíčĺňî íŕ Âëŕęîâĺňĺ

and instead look like this:

Автоматизирани Съобщения за Движението на Влаковете

The first text uses the Windows-1250 encoding which I guess was set as some sort of a default for the software I used, since those sequence of letters look too familiar to me.

But wait, 1251, now 1250, just how many of them exist? And what is (or was) their purpose if UTF-8/16/32 exist already?

Well, obviously, those UTF encodings didn’t exist back then when multiple different character encodings were present to add all the various different letters, diacritics and symbols used in languages different than English. That is why later Unicode was called that way — it presents a Unified encoding solution.

All the encodings from the Windows-125x family are 8-bit encodings that extend the good old ASCII table. You see, ASCII is only 7-bit (a total of 128 symbols), which means there is one bit left in the character byte that could add another 128 symbols.

But one cannot pick a universal set of 128 symbols that can be used for every other language, so there are dozens of different 8-bit encodings which differ only by what symbols they have chosen to encode with that last available bit.

For example, here is the Windows-1252 encoding table:

Windows-1252 mapping (taken from Wikipedia)

One can clearly see the ASCII symbols remain from values 0x00 to 0x7F. Bytes 0x80–0xFF are for the new ones added by this encoding. This encoding (also called Code Page 1252) is used by central, west and north European countries that use the Latin alphabet.

A cyrillic mapping exists for languages like Russian, Bulgarian, Serbian, etc, which is encoded in Code Page 1251 in the following manner:

Windows-1251 mapping (taken from Wikipedia). Yellow bytes highlight the differences with Windows-1252.

Okay, so now we know that the letters of the Bulgarian alphabet are fully contained inside bytes 0xC0–0xFF. All we need in order to encode and decode them is their respective Unicode variants:

The unicode block containing capital Cyrillic letters (taken from Wikipedia).

Fortunately enough, the mapping should be as easy as performing one simple arithmetic, because they appear to be in the very same order at positions 0x0410–0x044F of the Unicode table.

We are now ready to write our implementation.

Implementing it

We quickly find out that_UnicodeEncoding is not public and compiler forbids us from using it. But as we already discovered, Unicode.Encoding is a typealias for it:

// Using an `enum` because everything will be `static`
// and this type will resemble a namespace
enum Windows1251: Unicode.Encoding {
    // TODO
}

This is now allowed and we need to conform to this protocol. It requires us to define what data type should represent one code unit of the character encoding we are dealing with. The default is UInt8 and it is exactly what we need, since Windows-1251 is an 8-bit encoding:

enum Windows1251: Unicode.Encoding {
    typealias CodeUnit = UInt8
}

Another required thing is to supply a backwards and forwards parser that returns a Unicode.ParseResult which has the purpose of telling whether the input can be correctly decoded from / encoded into our character encoding. For my purpose of decoding only I reused the same parser and defined it the following way:

enum Windows1251: Unicode.Encoding {
    typealias CodeUnit = UInt8
    typealias EncodedScalar = CollectionOfOne<CodeUnit>
    typealias ForwardParser = Parser
    typealias ReverseParser = Parser
    
    struct Parser: Unicode.Parser {
        typealias Encoding = Windows1251
        
        mutating func parseScalar<I>(
            from input: inout I
        ) -> Unicode.ParseResult<EncodedScalar>
        where I: IteratorProtocol, I.Element == CodeUnit {
            guard let raw = input.next() else {
                return .emptyInput
            }
            
            switch raw {
            case 0x98:
                // this is the only invalid byte in Windows-1251,
                // as can be seen in the mapping table above
                return .error(length: 1)
            default:
                return .valid(.init(raw))
            }
        }
    }
}

What’s left is of course the encoding and decoding methods.

The process of encoding means translating from Unicode scalar units to the code units of our encoding (Windows-1251):

static func encode(_ content: Unicode.Scalar) -> EncodedScalar? {
    let uniCode = content.value
    switch uniCode {
    case 0x0410...0x044F:
        // The cyrillic letters we need are in the array 0xC0...0xFF
        // in the same order as in Unicode 0x0410...0x044F
        // => just shift the value
        return .init(.init(exactly: uniCode - 0x0410 + 0xC0)!)
            
    // TODO: Handle cases for returning 0x80...0xBF
            
    case 0...0xFF:
        return .init(.init(uniCode))
        
    default:
        return nil
    }
}

Decoding is the reverse of this (meaning, from Windows-1251 to Unicode):

static func decode(_ content: EncodedScalar) -> Unicode.Scalar {
    let byte = content.first!
    switch byte {
    case 0xC0...0xFF:
        // same logic as encoding, but the other way around
        return .init(UInt32(byte) - 0xC0 + 0x0410)!
            
    // TODO: Convert 0x80...0xBF appropriately
            
    default:
        // The rest matches Unicode
        return .init(byte)
    }
}

The only other requirement we need to fulfill is to supply a replacement character in case of bad bytes (I decided to just throw a fatalError() here for the sake of experimentation) and voila! We have our custom implementation of a Windows1251-to-Unicode mapping.

enum Windows1251: Unicode.Encoding {
    typealias CodeUnit = UInt8
    typealias EncodedScalar = CollectionOfOne<CodeUnit>
    typealias ForwardParser = Parser
    typealias ReverseParser = Parser
    
    struct Parser: Unicode.Parser {
        typealias Encoding = Windows1251
        
        mutating func parseScalar<I>(
            from input: inout I
        ) -> Unicode.ParseResult<EncodedScalar>
        where I: IteratorProtocol, I.Element == CodeUnit {
            guard let raw = input.next() else {
                return .emptyInput
            }
            
            switch raw {
            case 0x98:
                // this is the only invalid byte in Windows-1251,
                // as can be seen in the mapping table above
                return .error(length: 1)
            default:
                return .valid(.init(raw))
            }
        }
    }

    var encodedReplacementCharacter: EncodedScalar { fatalError() }

    static func encode(_ content: Unicode.Scalar) -> EncodedScalar? {
        let uniCode = content.value
        switch uniCode {
        case 0x0410...0x044F:
            // The cyrillic letters we need are in the array 0xC0...0xFF
            // in the same order as in Unicode 0x0410...0x044F
            // => just shift the value
            return .init(.init(exactly: uniCode - 0x0410 + 0xC0)!)
                
        // TODO: Handle cases for returning 0x80...0xBF
                
        case 0...0xFF:
            return .init(.init(uniCode))
            
        default:
            return nil
        }
    }

    static func decode(_ content: EncodedScalar) -> Unicode.Scalar {
        let byte = content.first!
        switch byte {
        case 0xC0...0xFF:
            // same logic as encoding, but the other way around
            return .init(UInt32(byte) - 0xC0 + 0x0410)!
                
        // TODO: Convert 0x80...0xBF appropriately
                
        default:
            // The rest matches Unicode
            return .init(byte)
        }
    }
}

This obviously is not a complete implementation but it should do the job for my purposes of decoding and parsing that Bulgarian webpage.

Now I can just say:

String(decoding: ..., as: Windows1251.self)

and hope that it will successfully decode the whole HTML content into a parseable string.

It finally works!

There is a certain amount of satisfaction in the process of going up the long unexplored pathway to end up on the top of the cliff, to behold and enjoy the view there.

A screenshot of my app that shows real-time train delays info.

The code is open-source and available here: https://github.com/yalishanda42/BDZ-Delays

Everyone is welcome to contribute to the encoder/decoder implementation or whatever they feel like helping with.

The type I created in this article is in BDZDelays/bdz-delays/Sources/CustomEncoding. I intend to put it in a separate repository in the future, along with implementations of other code pages. Please let me know if such a library exists already as I couldn’t find any.

You can also download the app on the AppStore, although I have to note that for now it is only in Bulgarian.

Thanks for reading!

Stay tuned for the hypothetical upcoming part 2 where I’d wish to preach the architecture I experimented with for this project.

Thank you for reading until the end. Please consider following the writer and this publication. Visit Stackademic to find out more about how we are democratizing free programming education around the world.