Writing a fast(er) youtube downloader

As a millenial, I never managed to embrace the streaming culture that's so prevalent nowadays. Having grown up in an era where being offline was the norm, I developed an unhealthy habit of just-in-case data hoarding that I can't seem to shake off. When youtube-dl suddenly started complaining about an "uploader id" error, I decided to take things into my own hands.

Understanding the streaming process

The recon step consisted of inspecting the HTTP traffic to see what I could find. Sure enough, the network tab showed some interesting requests when filtering by "medias":

Opening one of these in a new tab shows a MIME type error. The browser can't play it for some reason and neither could mpv. I downloaded it and ran the file command on it, but it shows up as a data blob:

$ file videoplayback.mp4
videoplayback.mp4: data

I suspected that there was some encryption going on and that maybe the Youtube video player knows how to decrypt this file. But then I started wondering why there were many of these HTTP requests there to begin with. After some investigation, it turned out that the Youtube player loads the video in small portions and then plays them in succession. It also has a similar process for audio. This led me to believe that I would need to reproduce this behavior. However, looking into the URL, I noticed a "range" parameter:

I'm not sure why Youtube handles this as a query parameter since HTTP already comes with a Range header, but I'm thinking that it's easier for them to cache partial resources this way. After removing range from the URL, the video loaded completely:

A similar trick worked for the audio stream. Now that I had these URLs, I could download them separately then join them with something like ffmpeg. But that would prove to be tedious to do every time I wanted to download a Youtube video. I snooped through the HTML response of a few youtube URLs, and found a mention of the googlevideo.com domain in some of them:

(After running it through prettier)

The video URL has been avaliable in the HTML all along. All I had to do was extract it and decode the \u0026 escape sequences into ampersands. Since it's embedded in Javascript code and not JSON (notice the missing quotes around the keys), I decided to use regex against my better judgement:


class SimpleYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
    this(string html)
    {
        this.html = html;
        parser = createDocument(html);
    }

    override string getURL(int itag = 18)
    {
        return html
            .matchOrFail(`"itag":` ~ itag.to!string ~ `,"url":"(.*?)"`)
            .replace(`\u0026`, "&");
    }
}

This worked for many videos, but there were a few stubborn exceptions. Looking into it, I realized that it was mostly VEVO videos. Hmm... Probably some DRM protection. The way these are handled is that Youtube adds an s parameter to the URL, so we have to reverse engineer it. Also, it's now in a signatureCipher field:

signatureCipher contains three parameters: s, sp and url.

I'll spare you the details of how I worked out what they do, but the way it works is that youtube applies three types of transformations on the s parameter before putting it in the value of the sp parameter. For example: if sp=sig, then the decrypted s value is appended to url as sig=....

As for the transformations, there are three types:

  • Reversing the cipher
  • Swapping the first character of the cipher with the character at index given by the transformations's second argument
  • Removing the first N characters from the cipher

The problem with these transformations is that they differ for each URL, so I wasn't able to extract the sequence then translate it into D code and call it a day. This approach had to be done at runtime... with regex:

With that, we get a decoded s parameter. All that's left is to append it to the video URL as whichever value comes in the sp parameter:


class AdvancedYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
    private string baseJS;

    this(string html, string baseJS)
    {
        this.html = html;
        this.parser = createDocument(html);
        this.baseJS = baseJS;
    }

    override string getURL(int itag = 18)
    {
        string signatureCipher = findSignatureCipher(itag);
        string[string] params = signatureCipher.parseQueryString();
        auto algorithm = EncryptionAlgorithm(baseJS);
        string sig = algorithm.decrypt(params["s"]);
        return params["url"].decodeComponent() ~
            "&" ~
            params["sp"] ~
            "=" ~
            sig;
    }
    //...
}

Wrapping it in an executable

Now that the parsing logic is handled, all that's left is to download the URL into a local file. Luckily Youtube didn't have measures against 3rd party clients so I simply used D's cURL bindings. A bug I faced early on was that certain URLs go through redirects, which is something I neglected to handle in the code at first. It took me an embarassing amount of time to figure out why the video had a Content-Length of 0 but once I realized my mistake, I fixed it by setting the follow_location CurlOption to true. Resuming previously stopped downloads was a matter of setting CurlOption.resume_from to the local file's size in bytes then opening it in append mode:


    public void download(string destination, string url, string referer)
    {
        auto http = Curl();
        http.initialize();
        if(destination.exists)
        {
            writeln("Resuming from byte ", destination.getSize());
            http.set(CurlOption.resume_from, destination.getSize());
        }


        auto file = File(destination, "ab");
        //...
    }

std.net.curl supports progress reporting by means of a callback that gets periodically executed by the internal code. I took advantage of this and calculated a progress percentage that I then displayed using \r to make it look like an animation:


if(current == 0 || total == 0)
{
    return 0;
}
auto percentage = 100.0 * (cast(float)(current) / total);
writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, total / 1024.0 / 1024.0);
return 0;

Making it faster

As was the case for youtube-dl, my downloads were rate limited down to 60 kilobytes per second. Not exactly sure how to bypass it, but after noticing that the web player downloads the video in parallel chunks, I figured that the API was probably designed to behave that way. Then again someone did figure out why youtube throttles download speeds in this Github issue. I looked at a few other youtube-dl's Github issues and saw a bunch of slowdown-related tickets, so I figured that this is something that would require plenty of maintenance going forward. In the end, I decided not to deviate too much from the web player's behavior by downloading the video parts in parallel. I'm downloading it at 4 separate chunks at this point, which gives a maximum speed of 240-ish kilobytes per second, but I'm planning on making it configurable.

Update of 01/01/2024: the downloader has been dethrottled in this PR. The approach I took involved extracting and evaluating (with duktaped) a piece of Javascript code from the base.js web player with the aim of deciphering the n paramater of the video URL.

To do that, I first calculated the ranges of each chunk based on the video's Content-Length header. In case the total size isn't divisible by 4, the last chunk ends up being larger than the others which is no biggie:


    private ulong[] calculateOffset(ulong length, int chunks, ulong index)
    {
        ulong start = index * (length / chunks);
        ulong end = chunks == index + 1 ? length : start + (length / chunks);
        if(index > 0)
        {
            start++;
        }
        return [start, end];
    }

    unittest
    {
        auto downloader = new ParallelDownloader("", "");
        ulong length = 23;
        assert([0, 5] == downloader.calculateOffset(length, 4, 0));
        assert([5 + 1, 10] == downloader.calculateOffset(length, 4, 1));
        assert([10 + 1, 15] == downloader.calculateOffset(length, 4, 2));
        assert([15 + 1, 23] == downloader.calculateOffset(length, 4, 3));
    }

This part was crucial because otherwise, the video would become corrupted due to missing data in the middle. I was surprised to find out that the mp4 format was fickle like that, and this made it impossible for me to implement things like downloading a working video from a given offset in seconds. But with that out of the way, I then downloaded the four parts in parallel with D's parallel foreach:

public void download(string destination, string url)
{
    ulong length = url.getContentLength();
    writeln("Length = ", length);
    int chunks = 4;
    string[] destinations = new string[chunks];
    foreach(i, e; iota(0, chunks).parallel)
    {
        ulong[] offsets = calculateOffset(length, chunks, i);
        string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
        string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
            title, id, offsets[0], offsets[1], i
        ).sanitizePath();
        destinations[i] = partialDestination;
        //...
    }
    //...
}

Nothhw tpe yawrve o oblems

Ah, I'm no stranger to the woes of multithreading.

The first issue I ran into was progress reporting: youtube-d was now reporting a quarter of the real size instead of the full size... because I inadvertently told it to. The second and more pressing issue was that the four progress reporting functions started writing on top of each other. This was fine for the most part, except in situations where a thread is lagging behind the others, making the progress reporting look like there are occasional regressions.

To fix both problems, I refactored RegularDownloader to take a progress reporting callback as a constructor argument. And then, in the download function of ParallelDownloader, I created four instances of RegularDownloader and gave them a progress reporting callback that calculates the currently downloaded bytes by adding up the sizes of the four partial files:


//...
string[] destinations = new string[chunks];
foreach(i, e; iota(0, chunks).parallel)
{
    ulong[] offsets = calculateOffset(length, chunks, i);
    string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
    string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
        title, id, offsets[0], offsets[1], i
    ).sanitizePath();
    destinations[i] = partialDestination;

    new RegularDownloader((ulong _, ulong __) {
        if(length == 0)
        {
            return 0;
        }
        ulong current = destinations.map!(d => d.exists() ? d.getSize() : 0).sum();
        auto percentage = 100.0 * (cast(float)(current) / length);
        writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, length / 1024.0 / 1024.0);
        return 0;
    }).download(partialDestination, partialLink, url);
}

With that, the four progress functions report the correct progress. They're still concurrently accessing the standard output, but it doesn't matter to me because they still gave the correct result. Except that now, the reporting frequency increased fourfold. It's not a bug, it's a feature ¯\_(ツ)_/¯

The parallel executions each write to a file whose name is formatted in such a way that the order is embedded within the file name. This allows me to concatenate the chunks later on with the correct order. Why not put the filenames in an ordered container? That would work if everything is downloaded while that data structure is in memory. But if it crashes for some reason, I wouldn't be able to work out the correct order if the metadata isn't persisted. With this approach, previously broken downloads can resume from where the last execution left off.

Once all four threads have finished running, I compare the sum of the size of the four .part files with the the expected video length to make sure they match. This is to make sure that it's only concatenating the parts when the download has finished successfully. Once that's taken care of, the .part files are deleted.

Quality of life improvements

With the core functionality in place, I added a few additional features, mainly to list the available formats and to get the video URL without downloading it. Right now I'm working on verbosity levels because youtube-d spits out a lot of debugging info. Once that's stabilized, I'll add the very first release. In the meantime, you can grab a debug build from the "Build" action artifacts.

Edit: automated releases are now available. An optimized build is downloadable from the releases page.

Commentaires

  1. Thanks so much for this - it's amazing so far - but how do I specify what directory the video should be downloaded to? I'm on Windows and it's downloading all videos to the top level of my user directory, which is a bit awkward.

    RépondreSupprimer
  2. Thanks, glad to hear it's working for you on Windows. I don't have access to a Windows machine at this moment to test this, but try adding the youtube-d directory (the one with both youtube-d.exe and libcurl.dll) to your path. This way it can be invoked from any directory you want

    RépondreSupprimer
    Réponses
    1. Thanks for the reply but please re-read my comment. The issue isn't which directory I can invoke the commands from but rather which one the videos are downloaded to.

      Supprimer
    2. No I understand, you probably meant something like youtube-d --output-directory /tmp

      The reason I mentioned the path trick is that there's a close equivalent that consists of running something like cd /tmp && youtube-d . This can be used as a temporary measure until the feature is implemented.

      Supprimer
    3. Oh right, I didn't think to simply change the directory before executing the commands, and that does do the trick. Sorry, I'm a little rusty with command line tools. Thanks!

      Supprimer
  3. I hadn't had any issues with the tool until just now. When I try to run any command on the video with v=Nc7eEXXSyZ0, I get the following error messages:

    Failed to parse encryption steps...
    Retry 1 of 2...
    Handling [URL]
    Cache miss, downloading HTML...
    base.js cache miss, downloading from youtube[dot]com/s/player/652ba3a2/player_ias.vflset/en_US/base.js
    Downloaded video HTML
    [Video Title]
    Key not found: formats

    For some reason, this video is encrypted (has a signatureCipher), but I'm not having this issue with the other encrypted videos I just tried it on or with a similar (but unencrypted) video on the same channel (v=iQiieJGA5Ws). Any idea how and when this can be fixed and whether there's any workaround I could use right now?

    RépondreSupprimer
  4. Regarding my previous comment about the error messages, I think I figured out what the issue was (and it should be an easy fix): The JSON in the HTML only had an "adaptiveFormats" object but not a "formats" object, and the code seems to expect both (see lines 43 & 62 of parsers.d).

    Will the code also be able to handle cases where one of these objects contains only a single format object rather than an array of them? It's not clear to me that it will.

    Btw, the HTML of the video I was having an issue with strangely just changed to include both objects, so you won't be able to reproduce the issue with it. You might have to manually modify some HTML instead.

    RépondreSupprimer
    Réponses
    1. Hmm true, I couldn't reproduce it but I see what you mean. I suspect that Google might be phasing out regular formats in favor of adaptiveFormats, though I hope that isn't the case. The latter only includes formats that support either video or audio, but not both at the same time. You're right about the code, I created a ticket for it and will introduce it in the upcoming release: https://github.com/azihassan/youtube-d/issues/68

      Supprimer
  5. And now I'm getting timeout errors when the download is rate limited. This happened three times in a row (with both parallel and regular modes), so I just gave up and downloaded the video another way. Is there any way to keep this from ever happening? Timeouts should never happen with downloads IMO - if I think it's taking too long, I can kill the operation myself. Here's the output:

    Failed to solve N parameter, downloads might be rate limited...
    [4.14 %] 5.04 / 121.75 MBTimeout was reached on handle 1C706BCAB60

    RépondreSupprimer
    Réponses
    1. OK so this one got me stumped if I'm being honest. I haven't ran in this issue yet, so I'm not sure how I can reproduce it. Did it start happening just recently? I think the N parameter is a clue here, maybe the algorithm has changed as of late and they're purposely causing timeouts in requests that don't include a correct N parameter.

      Supprimer
    2. Ok, the problem is that you've set curl to time out after 3 minutes - see line 56 of downloaders.d - and that's exactly what's happening. If you just remove this line, that should fix it since the default for this setting in curl is 0, which means it never times out. I hadn't ever seen this issue before but that was only because there hadn't ever been an issue with bypassing the rate limiting until now (and I've always used the parallel setting), so the download has always finished in under 3 minutes. But in the tests I just did, the throttling was so bad that it only got 12 MB before timing out each time. You never know how long it'll take (long videos could take hours) so there really shouldn't be any timeout at all IMO.

      FYI, here's a separate potential issue that I should've mentioned earlier: Right above the "Failed to solve N parameter" message, there was a "TypeError: undefined not callable" message.

      Supprimer
    3. Oooh nice catch, thanks! I reproduced it on my machine, here's the corresponding ticket: https://github.com/azihassan/youtube-d/issues/69

      I couldn't reproduce the N parameter error using the base.js file mentioned in your other comment (youtube[dot]com/s/player/652ba3a2/player_ias.vflset/en_US/base.js). I did however run into a similar issue with a different base.js file (717a6f94). It was corrected as part of the following PR, which came some time after the 0.0.4 release: https://github.com/azihassan/youtube-d/pull/52

      Since I haven't ran into N parameter solving issues yet, I'm going to assume that PR 52 solved it but I'll keep an eye out for similar problems since Youtube tends to change things around every once in a while.

      Supprimer

Enregistrer un commentaire

Posts les plus consultés de ce blog

My experience with Win by Inwi

Porting a Golang and Rust CLI tool to D