Writing a fast(er) youtube downloader

As a millenial, I never managed to embrace the streaming culture that's so prevalent nowadays. Having grown up in an era where being offline was the norm, I developed an unhealthy habit of just-in-case data hoarding that I can't seem to shake off. When youtube-dl suddenly started complaining about an "uploader id" error, I decided to take things into my own hands.

Understanding the streaming process

The recon step consisted of inspecting the HTTP traffic to see what I could find. Sure enough, the network tab showed some interesting requests when filtering by "medias":

Opening one of these in a new tab shows a MIME type error. The browser can't play it for some reason and neither could mpv. I downloaded it and ran the file command on it, but it shows up as a data blob:

$ file videoplayback.mp4
videoplayback.mp4: data

I suspected that there was some encryption going on and that maybe the Youtube video player knows how to decrypt this file. But then I started wondering why there were many of these HTTP requests there to begin with. After some investigation, it turned out that the Youtube player loads the video in small portions and then plays them in succession. It also has a similar process for audio. This led me to believe that I would need to reproduce this behavior. However, looking into the URL, I noticed a "range" parameter:

I'm not sure why Youtube handles this as a query parameter since HTTP already comes with a Range header, but I'm thinking that it's easier for them to cache partial resources this way. After removing range from the URL, the video loaded completely:

A similar trick worked for the audio stream. Now that I had these URLs, I could download them separately then join them with something like ffmpeg. But that would prove to be tedious to do every time I wanted to download a Youtube video. I snooped through the HTML response of a few youtube URLs, and found a mention of the googlevideo.com domain in some of them:

(After running it through prettier)

The video URL has been avaliable in the HTML all along. All I had to do was extract it and decode the \u0026 escape sequences into ampersands. Since it's embedded in Javascript code and not JSON (notice the missing quotes around the keys), I decided to use regex against my better judgement:


class SimpleYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
    this(string html)
    {
        this.html = html;
        parser = createDocument(html);
    }

    override string getURL(int itag = 18)
    {
        return html
            .matchOrFail(`"itag":` ~ itag.to!string ~ `,"url":"(.*?)"`)
            .replace(`\u0026`, "&");
    }
}

This worked for many videos, but there were a few stubborn exceptions. Looking into it, I realized that it was mostly VEVO videos. Hmm... Probably some DRM protection. The way these are handled is that Youtube adds an s parameter to the URL, so we have to reverse engineer it. Also, it's now in a signatureCipher field:

signatureCipher contains three parameters: s, sp and url.

I'll spare you the details of how I worked out what they do, but the way it works is that youtube applies three types of transformations on the s parameter before putting it in the value of the sp parameter. For example: if sp=sig, then the decrypted s value is appended to url as sig=....

As for the transformations, there are three types:

  • Reversing the cipher
  • Swapping the first character of the cipher with the character at index given by the transformations's second argument
  • Removing the first N characters from the cipher

The problem with these transformations is that they differ for each URL, so I wasn't able to extract the sequence then translate it into D code and call it a day. This approach had to be done at runtime... with regex:

With that, we get a decoded s parameter. All that's left is to append it to the video URL as whichever value comes in the sp parameter:


class AdvancedYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
    private string baseJS;

    this(string html, string baseJS)
    {
        this.html = html;
        this.parser = createDocument(html);
        this.baseJS = baseJS;
    }

    override string getURL(int itag = 18)
    {
        string signatureCipher = findSignatureCipher(itag);
        string[string] params = signatureCipher.parseQueryString();
        auto algorithm = EncryptionAlgorithm(baseJS);
        string sig = algorithm.decrypt(params["s"]);
        return params["url"].decodeComponent() ~
            "&" ~
            params["sp"] ~
            "=" ~
            sig;
    }
    //...
}

Wrapping it in an executable

Now that the parsing logic is handled, all that's left is to download the URL into a local file. Luckily Youtube didn't have measures against 3rd party clients so I simply used D's cURL bindings. A bug I faced early on was that certain URLs go through redirects, which is something I neglected to handle in the code at first. It took me an embarassing amount of time to figure out why the video had a Content-Length of 0 but once I realized my mistake, I fixed it by setting the follow_location CurlOption to true. Resuming previously stopped downloads was a matter of setting CurlOption.resume_from to the local file's size in bytes then opening it in append mode:


    public void download(string destination, string url, string referer)
    {
        auto http = Curl();
        http.initialize();
        if(destination.exists)
        {
            writeln("Resuming from byte ", destination.getSize());
            http.set(CurlOption.resume_from, destination.getSize());
        }


        auto file = File(destination, "ab");
        //...
    }

std.net.curl supports progress reporting by means of a callback that gets periodically executed by the internal code. I took advantage of this and calculated a progress percentage that I then displayed using \r to make it look like an animation:


if(current == 0 || total == 0)
{
    return 0;
}
auto percentage = 100.0 * (cast(float)(current) / total);
writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, total / 1024.0 / 1024.0);
return 0;

Making it faster

As was the case for youtube-dl, my downloads were rate limited down to 60 kilobytes per second. Not exactly sure how to bypass it, but after noticing that the web player downloads the video in parallel chunks, I figured that the API was probably designed to behave that way. Then again someone did figure out why youtube throttles download speeds in this Github issue. I looked at a few other youtube-dl's Github issues and saw a bunch of slowdown-related tickets, so I figured that this is something that would require plenty of maintenance going forward. In the end, I decided not to deviate too much from the web player's behavior by downloading the video parts in parallel. I'm downloading it at 4 separate chunks at this point, which gives a maximum speed of 240-ish kilobytes per second, but I'm planning on making it configurable.

Update of 01/01/2024: the downloader has been dethrottled in this PR. The approach I took involved extracting and evaluating (with duktaped) a piece of Javascript code from the base.js web player with the aim of deciphering the n paramater of the video URL.

To do that, I first calculated the ranges of each chunk based on the video's Content-Length header. In case the total size isn't divisible by 4, the last chunk ends up being larger than the others which is no biggie:


    private ulong[] calculateOffset(ulong length, int chunks, ulong index)
    {
        ulong start = index * (length / chunks);
        ulong end = chunks == index + 1 ? length : start + (length / chunks);
        if(index > 0)
        {
            start++;
        }
        return [start, end];
    }

    unittest
    {
        auto downloader = new ParallelDownloader("", "");
        ulong length = 23;
        assert([0, 5] == downloader.calculateOffset(length, 4, 0));
        assert([5 + 1, 10] == downloader.calculateOffset(length, 4, 1));
        assert([10 + 1, 15] == downloader.calculateOffset(length, 4, 2));
        assert([15 + 1, 23] == downloader.calculateOffset(length, 4, 3));
    }

This part was crucial because otherwise, the video would become corrupted due to missing data in the middle. I was surprised to find out that the mp4 format was fickle like that, and this made it impossible for me to implement things like downloading a working video from a given offset in seconds. But with that out of the way, I then downloaded the four parts in parallel with D's parallel foreach:

public void download(string destination, string url)
{
    ulong length = url.getContentLength();
    writeln("Length = ", length);
    int chunks = 4;
    string[] destinations = new string[chunks];
    foreach(i, e; iota(0, chunks).parallel)
    {
        ulong[] offsets = calculateOffset(length, chunks, i);
        string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
        string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
            title, id, offsets[0], offsets[1], i
        ).sanitizePath();
        destinations[i] = partialDestination;
        //...
    }
    //...
}

Nothhw tpe yawrve o oblems

Ah, I'm no stranger to the woes of multithreading.

The first issue I ran into was progress reporting: youtube-d was now reporting a quarter of the real size instead of the full size... because I inadvertently told it to. The second and more pressing issue was that the four progress reporting functions started writing on top of each other. This was fine for the most part, except in situations where a thread is lagging behind the others, making the progress reporting look like there are occasional regressions.

To fix both problems, I refactored RegularDownloader to take a progress reporting callback as a constructor argument. And then, in the download function of ParallelDownloader, I created four instances of RegularDownloader and gave them a progress reporting callback that calculates the currently downloaded bytes by adding up the sizes of the four partial files:


//...
string[] destinations = new string[chunks];
foreach(i, e; iota(0, chunks).parallel)
{
    ulong[] offsets = calculateOffset(length, chunks, i);
    string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
    string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
        title, id, offsets[0], offsets[1], i
    ).sanitizePath();
    destinations[i] = partialDestination;

    new RegularDownloader((ulong _, ulong __) {
        if(length == 0)
        {
            return 0;
        }
        ulong current = destinations.map!(d => d.exists() ? d.getSize() : 0).sum();
        auto percentage = 100.0 * (cast(float)(current) / length);
        writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, length / 1024.0 / 1024.0);
        return 0;
    }).download(partialDestination, partialLink, url);
}

With that, the four progress functions report the correct progress. They're still concurrently accessing the standard output, but it doesn't matter to me because they still gave the correct result. Except that now, the reporting frequency increased fourfold. It's not a bug, it's a feature ¯\_(ツ)_/¯

The parallel executions each write to a file whose name is formatted in such a way that the order is embedded within the file name. This allows me to concatenate the chunks later on with the correct order. Why not put the filenames in an ordered container? That would work if everything is downloaded while that data structure is in memory. But if it crashes for some reason, I wouldn't be able to work out the correct order if the metadata isn't persisted. With this approach, previously broken downloads can resume from where the last execution left off.

Once all four threads have finished running, I compare the sum of the size of the four .part files with the the expected video length to make sure they match. This is to make sure that it's only concatenating the parts when the download has finished successfully. Once that's taken care of, the .part files are deleted.

Quality of life improvements

With the core functionality in place, I added a few additional features, mainly to list the available formats and to get the video URL without downloading it. Right now I'm working on verbosity levels because youtube-d spits out a lot of debugging info. Once that's stabilized, I'll add the very first release. In the meantime, you can grab a debug build from the "Build" action artifacts.

Edit: automated releases are now available. An optimized build is downloadable from the releases page.

Commentaires

  1. Thanks so much for this - it's amazing so far - but how do I specify what directory the video should be downloaded to? I'm on Windows and it's downloading all videos to the top level of my user directory, which is a bit awkward.

    RépondreSupprimer
  2. Thanks, glad to hear it's working for you on Windows. I don't have access to a Windows machine at this moment to test this, but try adding the youtube-d directory (the one with both youtube-d.exe and libcurl.dll) to your path. This way it can be invoked from any directory you want

    RépondreSupprimer

Enregistrer un commentaire

Posts les plus consultés de ce blog

My experience with Win by Inwi

Porting a Golang and Rust CLI tool to D