Writing a fast(er) youtube downloader
As a millenial, I never managed to embrace the streaming culture that's so prevalent nowadays. Having grown up in an era where being offline was the norm, I developed an unhealthy habit of just-in-case data hoarding that I can't seem to shake off. When youtube-dl suddenly started complaining about an "uploader id" error, I decided to take things into my own hands.
Understanding the streaming process
The recon step consisted of inspecting the HTTP traffic to see what I could find. Sure enough, the network tab showed some interesting requests when filtering by "medias":
Opening one of these in a new tab shows a MIME type error. The browser can't play it for some reason and neither could mpv. I downloaded it and ran the file command on it, but it shows up as a data blob:
$ file videoplayback.mp4 videoplayback.mp4: data
I suspected that there was some encryption going on and that maybe the Youtube video player knows how to decrypt this file. But then I started wondering why there were many of these HTTP requests there to begin with. After some investigation, it turned out that the Youtube player loads the video in small portions and then plays them in succession. It also has a similar process for audio. This led me to believe that I would need to reproduce this behavior. However, looking into the URL, I noticed a "range" parameter:
I'm not sure why Youtube handles this as a query parameter since HTTP already comes with a Range header, but I'm thinking that it's easier for them to cache partial resources this way. After removing range
from the URL, the video loaded completely:
A similar trick worked for the audio stream. Now that I had these URLs, I could download them separately then join them with something like ffmpeg. But that would prove to be tedious to do every time I wanted to download a Youtube video. I snooped through the HTML response of a few youtube URLs, and found a mention of the googlevideo.com domain in some of them:
(After running it through prettier)
The video URL has been avaliable in the HTML all along. All I had to do was extract it and decode the \u0026 escape sequences into ampersands. Since it's embedded in Javascript code and not JSON (notice the missing quotes around the keys), I decided to use regex against my better judgement:
class SimpleYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
this(string html)
{
this.html = html;
parser = createDocument(html);
}
override string getURL(int itag = 18)
{
return html
.matchOrFail(`"itag":` ~ itag.to!string ~ `,"url":"(.*?)"`)
.replace(`\u0026`, "&");
}
}
This worked for many videos, but there were a few stubborn exceptions. Looking into it, I realized that it was mostly VEVO videos. Hmm... Probably some DRM protection. The way these are handled is that Youtube adds an s
parameter to the URL, so we have to reverse engineer it. Also, it's now in a signatureCipher
field:
signatureCipher
contains three parameters: s
, sp
and url
.
I'll spare you the details of how I worked out what they do, but the way it works is that youtube applies three types of transformations on the s
parameter before putting it in the value of the sp
parameter. For example: if sp=sig
, then the decrypted s
value is appended to url
as sig=...
.
As for the transformations, there are three types:
- Reversing the cipher
- Swapping the first character of the cipher with the character at index given by the transformations's second argument
- Removing the first N characters from the cipher
The problem with these transformations is that they differ for each URL, so I wasn't able to extract the sequence then translate it into D code and call it a day. This approach had to be done at runtime... with regex:
-
Extract base.js URL and download its contents:
- Extract the
s
parameter from thesignatureCipher
field of a givenitag
- Extract the obfuscation sequence from base.js between
a = a.split("");
andreturn a.join("")
-
Match each obfuscation step to the flip, left strip or first character swap transformations:
- Sequentially apply the transformations to the
s
parameter
With that, we get a decoded s
parameter. All that's left is to append it to the video URL as whichever value comes in the sp
parameter:
class AdvancedYoutubeVideoURLExtractor : YoutubeVideoURLExtractor
{
private string baseJS;
this(string html, string baseJS)
{
this.html = html;
this.parser = createDocument(html);
this.baseJS = baseJS;
}
override string getURL(int itag = 18)
{
string signatureCipher = findSignatureCipher(itag);
string[string] params = signatureCipher.parseQueryString();
auto algorithm = EncryptionAlgorithm(baseJS);
string sig = algorithm.decrypt(params["s"]);
return params["url"].decodeComponent() ~
"&" ~
params["sp"] ~
"=" ~
sig;
}
//...
}
Wrapping it in an executable
Now that the parsing logic is handled, all that's left is to download the URL into a local file. Luckily Youtube didn't have measures against 3rd party clients so I simply used D's cURL bindings. A bug I faced early on was that certain URLs go through redirects, which is something I neglected to handle in the code at first. It took me an embarassing amount of time to figure out why the video had a Content-Length
of 0 but once I realized my mistake, I fixed it by setting the follow_location CurlOption to true. Resuming previously stopped downloads was a matter of setting CurlOption.resume_from
to the local file's size in bytes then opening it in append mode:
public void download(string destination, string url, string referer)
{
auto http = Curl();
http.initialize();
if(destination.exists)
{
writeln("Resuming from byte ", destination.getSize());
http.set(CurlOption.resume_from, destination.getSize());
}
auto file = File(destination, "ab");
//...
}
std.net.curl
supports progress reporting by means of a callback that gets periodically executed by the internal code. I took advantage of this and calculated a progress percentage that I then displayed using \r
to make it look like an animation:
if(current == 0 || total == 0)
{
return 0;
}
auto percentage = 100.0 * (cast(float)(current) / total);
writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, total / 1024.0 / 1024.0);
return 0;
Making it faster
As was the case for youtube-dl, my downloads were rate limited down to 60 kilobytes per second. Not exactly sure how to bypass it, but after noticing that the web player downloads the video in parallel chunks, I figured that the API was probably designed to behave that way. Then again someone did figure out why youtube throttles download speeds in this Github issue. I looked at a few other youtube-dl's Github issues and saw a bunch of slowdown-related tickets, so I figured that this is something that would require plenty of maintenance going forward. In the end, I decided not to deviate too much from the web player's behavior by downloading the video parts in parallel. I'm downloading it at 4 separate chunks at this point, which gives a maximum speed of 240-ish kilobytes per second, but I'm planning on making it configurable.
Update of 01/01/2024: the downloader has been dethrottled in this PR. The approach I took involved extracting and evaluating (with duktaped) a piece of Javascript code from the base.js web player with the aim of deciphering the n
paramater of the video URL.
To do that, I first calculated the ranges of each chunk based on the video's Content-Length
header. In case the total size isn't divisible by 4, the last chunk ends up being larger than the others which is no biggie:
private ulong[] calculateOffset(ulong length, int chunks, ulong index)
{
ulong start = index * (length / chunks);
ulong end = chunks == index + 1 ? length : start + (length / chunks);
if(index > 0)
{
start++;
}
return [start, end];
}
unittest
{
auto downloader = new ParallelDownloader("", "");
ulong length = 23;
assert([0, 5] == downloader.calculateOffset(length, 4, 0));
assert([5 + 1, 10] == downloader.calculateOffset(length, 4, 1));
assert([10 + 1, 15] == downloader.calculateOffset(length, 4, 2));
assert([15 + 1, 23] == downloader.calculateOffset(length, 4, 3));
}
This part was crucial because otherwise, the video would become corrupted due to missing data in the middle. I was surprised to find out that the mp4 format was fickle like that, and this made it impossible for me to implement things like downloading a working video from a given offset in seconds. But with that out of the way, I then downloaded the four parts in parallel with D's parallel foreach:
public void download(string destination, string url)
{
ulong length = url.getContentLength();
writeln("Length = ", length);
int chunks = 4;
string[] destinations = new string[chunks];
foreach(i, e; iota(0, chunks).parallel)
{
ulong[] offsets = calculateOffset(length, chunks, i);
string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
title, id, offsets[0], offsets[1], i
).sanitizePath();
destinations[i] = partialDestination;
//...
}
//...
}
Nothhw tpe yawrve o oblems
Ah, I'm no stranger to the woes of multithreading.
The first issue I ran into was progress reporting: youtube-d was now reporting a quarter of the real size instead of the full size... because I inadvertently told it to. The second and more pressing issue was that the four progress reporting functions started writing on top of each other. This was fine for the most part, except in situations where a thread is lagging behind the others, making the progress reporting look like there are occasional regressions.
To fix both problems, I refactored RegularDownloader
to take a progress reporting callback as a constructor argument. And then, in the download
function of ParallelDownloader
, I created four instances of RegularDownloader
and gave them a progress reporting callback that calculates the currently downloaded bytes by adding up the sizes of the four partial files:
//...
string[] destinations = new string[chunks];
foreach(i, e; iota(0, chunks).parallel)
{
ulong[] offsets = calculateOffset(length, chunks, i);
string partialLink = format!"%s&range=%d-%d"(url, offsets[0], offsets[1]);
string partialDestination = format!"%s-%s-%d-%d.mp4.part.%d"(
title, id, offsets[0], offsets[1], i
).sanitizePath();
destinations[i] = partialDestination;
new RegularDownloader((ulong _, ulong __) {
if(length == 0)
{
return 0;
}
ulong current = destinations.map!(d => d.exists() ? d.getSize() : 0).sum();
auto percentage = 100.0 * (cast(float)(current) / length);
writef!"\r[%.2f %%] %.2f / %.2f MB"(percentage, current / 1024.0 / 1024.0, length / 1024.0 / 1024.0);
return 0;
}).download(partialDestination, partialLink, url);
}
With that, the four progress functions report the correct progress. They're still concurrently accessing the standard output, but it doesn't matter to me because they still gave the correct result. Except that now, the reporting frequency increased fourfold. It's not a bug, it's a feature ¯\_(ツ)_/¯
The parallel executions each write to a file whose name is formatted in such a way that the order is embedded within the file name. This allows me to concatenate the chunks later on with the correct order. Why not put the filenames in an ordered container? That would work if everything is downloaded while that data structure is in memory. But if it crashes for some reason, I wouldn't be able to work out the correct order if the metadata isn't persisted. With this approach, previously broken downloads can resume from where the last execution left off.
Once all four threads have finished running, I compare the sum of the size of the four .part
files with the the expected video length to make sure they match. This is to make sure that it's only concatenating the parts when the download has finished successfully. Once that's taken care of, the .part
files are deleted.
Quality of life improvements
With the core functionality in place, I added a few additional features, mainly to list the available formats and to get the video URL without downloading it. Right now I'm working on verbosity levels because youtube-d spits out a lot of debugging info. Once that's stabilized, I'll add the very first release. In the meantime, you can grab a debug build from the "Build" action artifacts.
Edit: automated releases are now available. An optimized build is downloadable from the releases page.
Thanks so much for this - it's amazing so far - but how do I specify what directory the video should be downloaded to? I'm on Windows and it's downloading all videos to the top level of my user directory, which is a bit awkward.
RépondreSupprimerThanks, glad to hear it's working for you on Windows. I don't have access to a Windows machine at this moment to test this, but try adding the youtube-d directory (the one with both youtube-d.exe and libcurl.dll) to your path. This way it can be invoked from any directory you want
RépondreSupprimerThanks for the reply but please re-read my comment. The issue isn't which directory I can invoke the commands from but rather which one the videos are downloaded to.
SupprimerNo I understand, you probably meant something like youtube-d --output-directory /tmp
SupprimerThe reason I mentioned the path trick is that there's a close equivalent that consists of running something like cd /tmp && youtube-d . This can be used as a temporary measure until the feature is implemented.
Oh right, I didn't think to simply change the directory before executing the commands, and that does do the trick. Sorry, I'm a little rusty with command line tools. Thanks!
SupprimerI hadn't had any issues with the tool until just now. When I try to run any command on the video with v=Nc7eEXXSyZ0, I get the following error messages:
RépondreSupprimerFailed to parse encryption steps...
Retry 1 of 2...
Handling [URL]
Cache miss, downloading HTML...
base.js cache miss, downloading from youtube[dot]com/s/player/652ba3a2/player_ias.vflset/en_US/base.js
Downloaded video HTML
[Video Title]
Key not found: formats
For some reason, this video is encrypted (has a signatureCipher), but I'm not having this issue with the other encrypted videos I just tried it on or with a similar (but unencrypted) video on the same channel (v=iQiieJGA5Ws). Any idea how and when this can be fixed and whether there's any workaround I could use right now?
Regarding my previous comment about the error messages, I think I figured out what the issue was (and it should be an easy fix): The JSON in the HTML only had an "adaptiveFormats" object but not a "formats" object, and the code seems to expect both (see lines 43 & 62 of parsers.d).
RépondreSupprimerWill the code also be able to handle cases where one of these objects contains only a single format object rather than an array of them? It's not clear to me that it will.
Btw, the HTML of the video I was having an issue with strangely just changed to include both objects, so you won't be able to reproduce the issue with it. You might have to manually modify some HTML instead.
Hmm true, I couldn't reproduce it but I see what you mean. I suspect that Google might be phasing out regular formats in favor of adaptiveFormats, though I hope that isn't the case. The latter only includes formats that support either video or audio, but not both at the same time. You're right about the code, I created a ticket for it and will introduce it in the upcoming release: https://github.com/azihassan/youtube-d/issues/68
SupprimerAnd now I'm getting timeout errors when the download is rate limited. This happened three times in a row (with both parallel and regular modes), so I just gave up and downloaded the video another way. Is there any way to keep this from ever happening? Timeouts should never happen with downloads IMO - if I think it's taking too long, I can kill the operation myself. Here's the output:
RépondreSupprimerFailed to solve N parameter, downloads might be rate limited...
[4.14 %] 5.04 / 121.75 MBTimeout was reached on handle 1C706BCAB60
OK so this one got me stumped if I'm being honest. I haven't ran in this issue yet, so I'm not sure how I can reproduce it. Did it start happening just recently? I think the N parameter is a clue here, maybe the algorithm has changed as of late and they're purposely causing timeouts in requests that don't include a correct N parameter.
SupprimerOk, the problem is that you've set curl to time out after 3 minutes - see line 56 of downloaders.d - and that's exactly what's happening. If you just remove this line, that should fix it since the default for this setting in curl is 0, which means it never times out. I hadn't ever seen this issue before but that was only because there hadn't ever been an issue with bypassing the rate limiting until now (and I've always used the parallel setting), so the download has always finished in under 3 minutes. But in the tests I just did, the throttling was so bad that it only got 12 MB before timing out each time. You never know how long it'll take (long videos could take hours) so there really shouldn't be any timeout at all IMO.
SupprimerFYI, here's a separate potential issue that I should've mentioned earlier: Right above the "Failed to solve N parameter" message, there was a "TypeError: undefined not callable" message.
Oooh nice catch, thanks! I reproduced it on my machine, here's the corresponding ticket: https://github.com/azihassan/youtube-d/issues/69
SupprimerI couldn't reproduce the N parameter error using the base.js file mentioned in your other comment (youtube[dot]com/s/player/652ba3a2/player_ias.vflset/en_US/base.js). I did however run into a similar issue with a different base.js file (717a6f94). It was corrected as part of the following PR, which came some time after the 0.0.4 release: https://github.com/azihassan/youtube-d/pull/52
Since I haven't ran into N parameter solving issues yet, I'm going to assume that PR 52 solved it but I'll keep an eye out for similar problems since Youtube tends to change things around every once in a while.
Seems like Youtube has changed the system again, the links produced do not work. I tried adding &range=0-xxxxx to the link and it worked up to a certain size, very strange. I am am stuck. What is going on?
SupprimerThanks for the heads up! There were some unreleased commits in the main branch, I just published a new release that includes them. Give it a try and let me know if the issue persists: https://github.com/azihassan/youtube-d/releases/tag/v0.0.5
SupprimerWith that being said, I noticed that e2e tests (that consist of downloading a video and checking its md5 checksum) now fail on Github because Youtube is blocking Github IP addresses. yt-dlp has been having similar issues lately from what I see. I'll soon add proxy support in an attempt to get around this, but it looks like Youtube is becoming more and more aggressive about this.
Hi there, the update did not help. Same problem, I could not compile the new version either (v0.0.5), ...\dmd\dmd2\windows\bin\dub build -b release
SupprimerError Failed to spawn process "git" (The system cannot find the file specified.) . I compiled previous version just fine. Again, you can download with the range parameter added, but just some of it, maybe some specific start-end numbers are required?
Oh OK I see thanks, I finally reproduced it by using a format other than 18. I usually use 18 for my downloads because it includes both audio and video, so I didn't realize that the other formats were failing. I opened an issue here, I'll be looking into it ASAP: https://github.com/azihassan/youtube-d/issues/83
SupprimerCan you test yours with 18? I tested youtube-d -f 18 https://youtu.be/dQw4w9WgXcQ on my machine but it only works on 0.0.5: https://asciinema.org/a/40hFNxFdT3qQEgzIzkjgLerQD
Regarding the build issue, 0.0.5 had a change in how it's pulling the duktape library. Instead of pulling it from dub, it now pulls it from a Github repository. Because of that, it looks like it now requires git to be installed.
SupprimerI'll try to find a way to circumvent it, but in the meantime, try using the prebuilt releases or installing git and rebuilding. Here's what the build process looks like on a Github Windows runner for reference: https://github.com/azihassan/youtube-d/actions/runs/10427867737/job/28883244591
Hi again. Yes, -f 18 works in 0.0.5, not 0.0.4. I tested it. Seems like it only downloads the standard (ratebypass) video/audio file. This file also seems to have a unique n number (&n=xxxxx... parameter) while the others have the same. Any clue there?
SupprimerYes the N parameter is a value that youtube uses to detect bots, the speed gets rate limited if the N value isn't calculated correctly.
SupprimerAs to the reason why 18 has a different N param, I think it's because youtube had one N value for the regular formats and another for the "adaptiveFormats" (those withseparate audio and video streams). There were 2 regular formats in the past, one for 360p and one for 720p. However it looks like youtube are phasing out these regular formats.Since 18 is the only remaining regular format, I guess it makes sense for it to be different.
The reason why adaptiveFormats fail is as you guessed, it's missing the "range" parameter. I'm currently only doing a rangeless call in one place (for adaptiveFormats), and that is to send a HEAD request in order to retrieve the full size of the video. But since that's not supported any more, I need to get the video size from the HTML contents of the page. I'll test this change and let you know if it works.
Thanks again. The range aspect of this has really puzzled me a lot. I have fiddled around with different numbers (&range=x-y) , but i find no solution, not even if i extract the clen value (&clen=xxxxxx (filesize) parameter) , &range=0-filesize does not work. I can only get around 20 % of the files first part. Good luck finding out !! Cheers.
SupprimerYou're absolutely right, apologies for the misunderstanding. After changing the code to implement the change I mentioned above, I noticed that beyond a certain point, it only works for very small chunks. Like sub-500 bytes per chunk l. Otherwise it fails with a 403.
SupprimerI checked the ytd-lp project and it has similar issues. The web player now uses a proof of origin token (poToken) that can only be generated on a web browser. The current approach consists of using the iOS (or was it safari) player since it doesn't have these limitations, yet. I'm updating the issue I linked above as I find out more info.
Hi again. I tried the < 500 bytes option, no luck there...still, good luck !!
Supprimer