Edit Eza's Tumblr Scrape

Raw Source

Minified Source

Ezalias / Eza's Tumblr Scrape

/*
The MIT License (MIT)

Copyright (c) 2013 Ezalias

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/

// ==UserScript==
// @name        Eza's Tumblr Scrape
// @namespace   https://inkbunny.net/ezalias
// @description Creates a new page showing just the images from any Tumblr 
// @license     MIT
// @include     http://*?ezastumblrscrape*
// @include     https://*?ezastumblrscrape*
// @include     http://*/ezastumblrscrape*
// @include     http://*.tumblr.com/
// @include     https://*.tumblr.com/
// @include     http://*.tumblr.com/page/*
// @include     https://*.tumblr.com/page/*
// @include     http://*.tumblr.com/tagged/*
// @include     https://*.tumblr.com/tagged/*
// @include     http://*.tumblr.com/archive
// @include     http://*.co.vu/*
// @exclude    *imageshack.us*
// @exclude    *imageshack.com*
// @grant        GM_registerMenuCommand
// @version     4.3
// ==/UserScript==


// Create an imaginary page on the relevant Tumblr domain, mostly to avoid the ridiculous same-origin policy for public HTML pages. Populate page with all images from that Tumblr. Add links to this page on normal pages within the blog. 

// This script also works on off-site Tumblrs, by the way - just add /archive?ezastumblrscrape?scrapewholesite after the ".com" or whatever. Sorry it's not more concise. 

// This script is kept up-to-date on the Greasy Fork network:
// https://greasyfork.org/en/scripts/4801-eza-s-tumblr-scrape



// Make it work, make it fast, make it pretty - in that order. 

// TODO: 
// I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets. 
	// This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision. 
	// I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires some communication between divs to avoid constant repetition. 
// I should start handling "after the cut" situations somehow, e.g. http://banavalope.tumblr.com/post/72117644857/roachpatrol-punispompouspornpalace-happy-new
	// Just grab any link to a specific /post. Occasional duplication is fine, we don't care. 
	// Wait, shit. Every theme should link to every page. And my banavalope example doesn't even link to the same domain, so we couldn't get it with raw AJAX. Meh. It's just a rare problem we'll have to ignore. 
	// http://askleijon.tumblr.com/ezastumblrscrape is a good example - lots of posts link to outside images (mostly imgur) 
// I could detect "read more" links if I can identify the text-content portion of posts. links to /post/ pages are universal theme elements, but become special when they're something the user links to intentionally. 
	// for example: narcisso's dream on http://cute-blue.tumblr.com/ only shows the cover because the rest is behind a break. 
	// post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick. 
	// Look into Tumblr Saviour to see how they handle and filter out text posts. 
// Should non-image links from images be gathered at the top of each 'page' on the image browser? E.g. http://askNSFWcobaltsnow.tumblr.com links to Derpibooru a lot. Should those be listed before the images?
	// I worry it'd pick up a lot of crap, like facebook and the main page. More blacklists / whitelists. Save it for when individual posts are detected. 
// ScrapeWholeSite: 10 pages at once by doing 10 separate xmlhttpwhatever objects, waiting for each to flip some bit in a 10-bool array? Clumsy parallelism. Possibly recursion, if the check for are-we-all-done-yet is in the status==4 callback. 
	// I should probably implement a box and button for choosing lastpage, just for noob usability's sake. Maybe it'd only appear if pages==2. 
// Add a convenient interface for changing options? "Change browsing options" to unhide a div that lists every ?key=value pair, with text-entry boxes or radio buttons as appropriate, and a button that pushes a new URL into the address bar and re-hides the div. Would need to be separate from thumbnail toggle so long as anything false is suppressed in get_url or whatever. 
// Dropdown menus? Thumbnails yes/no, Pages At Once 1-20. These change the options_map settings immediately, so next/prev links will use them. Link to Apply Changes uses same ?startpage as current. 
	// Could I generalize that the way I've generalized Image Glutton? E.g., grab all links from a Pixiv gallery page, show all images and all manga pages. 
	// Possibly @include any ?scrapeeverythingdammit to grab all links and embed all pictures found on them. single-jump recursive web mirroring. (fucking same-domain policy!) 
// now that I've got key-value mapping, add a link for 'view original posts only (experimental).' er, 'hide reblogs?' difficult to accurately convey. 
	// make it an element of the post-scraping function. then it would also work on scrape-whole-tumblr. 
	// better yet: call it separately, then use the post-scraping function on each post-level chunk of HTML. i.e. call scrape_without_reblogs from scrape_whole_tumblr, split off each post into strings, and call soft_scrape_page( single_post_string ) to get all the same images. 
		// or would it be better to get all images from any post? doing this by-post means we aren't getting theme nonsense (mostly). 
	// maybe just exclude images where a link to another tumblr happens before the next image... no, text posts could screw that up. 
	// general post detection is about recognizing patterns. can we automate it heuristically? bear in mind it'd be done at least once per scrape-page, and possibly once per tumblr-page. 
// Add picturepush.com to whitelist - or just add anything with an image file extension? Once we're filtering duplicates, Facebook buttons won't matter. 
// user b84485 seems to be using the scrape-whole-site option to open image links in tabs, and so is annoyed by the 500/1280 duplicates. maybe a 'remove duplicates' button after the whole site's done?
	// It's a legitimately good idea. Lord knows I prefer opening images in tabs under most circumstances.  
	// Basically I want a "Browse Links" page instead of just "grab everything that isn't nailed down." 
// http://mekacrap.tumblr.com/post/82151443664/oh-my-looks-like-theres-some-pussy-under#dnr - lots of 'read more' stuff, for when that's implemented. 
// eza's tumblr scrape: "read more" might be tumblr standard. 
	// e.g. <p class="read_more_container"><a href="http://ladylovelycocks.tumblr.com/post/66964089115/stupid-comic-continued-under-readmore-more" class="read_more">Read More</a></p> 
	// http://c-enpai.tumblr.com/ - interesting content visible in /archive, but every page is 'themed' to be a blank front page. wtf. 
// "Scrape" link should appear in /archive, for consistency. Damn thing's unclickable on some themes. 
// why am I looking for specific domains to sort to the front? imgur, deviantart, etc. - just do it for any image that's not on *.tumblr.com, fool. 
// chokes on multi-thousand-page tumblrs like actual-vriska, at least when listing all pages. it's just link-heavy text. maybe skip having a div for every page and just append to one div. or skip divs and append to the raw document innerHTML. it could be a memory thing, if ajax elements are never destroyed. 
// multi-thousand-page tumblrs make "find image links from all pages" choke. massive memory use, massive CPU load. ridiculous. it's just text. (alright, it's links and ajax requests, but it's doggedly linear.) 
	// maybe skip individual divs and append the raw pile-of-links hypertext into one div. or skip divs entirely and append it straight to the document innerHTML.
	// could it be a memory leak thing? are ajax elements getting properly released and destroyed when their scope ends? kind of ridiculous either way, considering we're holding just a few kilobytes of text per page. 
	// try re-using the same ajax object. 
// Expand options_url to take an arbitrary list of key,value,key,value pairs. 
// Escape function in JS is encodeURI. We need 'safe' URLs as tag IDs. 
/* Assorted notes from another text file
. eza's tumblr scrape - testing open-loop vs. closed-loop updating for large tumblrs. caffeccino has 200-ish pages. from a cached state, and with stuff downloading, getting all 221 the old way takes 8m20s and has noticeable slowdown past 40-ish. new method takes 16m and is honestly not very fast from the outset. the use of a global variable might cause ugly locking. with js, who knows. 
. eza's tumblr fixiv? de-style everything by simply erasing the <style> block. 
. window.location has several sub-properties like pathname, search, and hash that may obviate string fuckery in eza's tumblr scrape. - https://developer.mozilla.org/en-US/docs/Web/API/Location#wikiArticle
. eza's tumblr scrape - test finishing whole page for displaying updates. (maybe only on ?scrapewholesite.) probably not too smart, but an interesting benchmark. only ever one document.body.innerHTML=thing;. 
. eza's tumblr scrape - ideal form of ajax function is 'var onchange = function(){dostuff}; my_ajax_function( url, onchange )'. all that's ever different is what happens when the request state changes.
. eza's tumblr scrape - support text-scraping 100 pages at once. re-use pagesatonce? lastpage? needs a startpage anyway. (might need to come after fixing the options changer, since otherwise the 'scrape whole tumblr' link in the thumbnail view would have weird side-effects.) 
. eza's tumblr scrape: definitely do everything-at-once page write for thumbnail/browse mode. return one list of urls per page, so e.g. ten separate lists. remove duplicates between all lists. then build the page and do a single html write. prior to that, write 'fetching pages...' or something. it should be pretty quick. it's not like it's terribly responsive when loading pages anwyay. scrolling doesn't work right. 
	. thinking e.g. http://whatdoesitlumpingmean.tumblr.com/archive?startpage=1?pagesatonce=10?find=/tagged/my-art?ezastumblrscrape?thumbnails which has big blue dots on every post.
	. see also http://herblesbians.tumblr.com/ with its gigantic tall banners  
	. alternate solution: check natural resolution, don't downscale tiny / narrow images. 
. eza's tumblr scrape: why doesn't the thumbnail page pick up mspadventures.com gifs? e.g. http://kitkaloid.tumblr.com/page/26, with tavros's face being 'dusted.' 
*/
// Soft_scrape_page should return non-image links from imgur, deviantart, etc., then collect them at the bottom of the div in thumbnail mode. 
	// links to embedded videos? linkdump at top, just below the /page link. looks like https://www.tumblr.com/video_file/119027046245/tumblr_nlt061qtgG1u32sbu/480 - e.g. manyakis.tumblr. 
// Tumblr has a standard mobile version. Fuck me, how long has that been there? example.tumblr.com/mobile, no CSS, bare image links. Shit on the fuck. 
	// Hey hey! This might allow trivial recognition of individual posts and reblogs vs. OC. Via, but no source... weak. Good enough for us, though. 
	// Every post is between a <p> and </p>, but can contain <p></p> blocks inside. Messy.
	// Reblogs say "(via <a href='http//example.tumblr.com/123456'>example</a>)" and original posts don't. 
	// Dates are noted in <h2> blocks, but they're outside any <p> blocks, so who cares. 
	// Images are linked (all in _500, grr, but we can obviously deal with that) but posts aren't. Shame. That would've been useful. 
	// Shit, consider the basics... do tags works? Pagination is just /mobile/page/2, etc. with tags: example.tumblr.com/tagged/homestuck/page/2/mobile. 
	// Are photosets handled correctly? What about read-more links? Uuugh, photosets just appear as "[video]". Literally that text. No link. Fuck! So close, aaand useless. 
	// I can use /mobile instead of /archive, but there's no point. It breaks favicons and I still have to fetch the fat-ass normal pages. 
	// I can probably use mobile pages to match normal pages, since they... wait, are they guaranteed to have the same post count? yes. alice grove has one post per page. 
		// So to find original posts, I have to fetch both normal and mobile pages, and... shit, and consistently separate posts on normal pages. It has to be identical. 
	// I can also use mobile for page count, since it's guaranteed to have forward / backward links. Ha! We can start testing at 100! 
	// Adding /mobile even works on individual posts. You can get a via link from any stupid theme. Yay. 
		// Add "show via/source links" or just "visit mobile page" as a Greasemonkey action / script command? 
	// Tumblr's own back/forward links are broken in the mobile theme. Goes to: /mobile/tagged/example. Should go to: /tagged/example/mobile. Modify those pages. 
// 	http://thegirlofthebeach.tumblr.com/archive - it's all still there, but the theme shows nothing but a 'fuck you, bye' message. 
	// Pages still provide an og:image link. Unfortunately, that's a single image, even for photosets. Time to do some reasoning about photoset URLs and their constituent image URLs. 
	// Oh yeah - mobile. That gives us a page count, at least, but then photosets don't provide even that first image. 
	// Add a tag ?usemobile for using /mobile when scraping or browsing images. 
	// To do when /archive works and provides photosets in addition to /mobile images: http://fotophest.tumblr.com
	// Archive does allow seeking by year/month, e.g. http://fotophest.tumblr.com/archive/2012/4 
	// example.tumblr.com/page/1/mobile always points to example.tumblr.com. Have to do example.tumblr.com/mobile. Ugh. 
// Archival note: since Tumblr images basically never disappear, it SHOULD suffice to save the full scrape of a blog into a text file. I don't need to temporarily mass-download the whole image set, as I've been doing. 
	// Add "tagged example" to title for ?find=/tagged/example, to make this easier. 
	// Make a browser for these text files. Use the image-browser interface to display ten pages at once (by opening the file via a prompt, since file:// would kvetch about same-origin policy.) Maintain page links.
	// Filter duplicates globally, in this mode. No reason not to. 
// Use ?lastpage or ?end to indicate last detected page. (Indicate that it's approximate.) I keep checking ?scrapewholesite because I forget if I'm blowing through twenty pages or two hundred. 
// Given multiple ?key=value definitions on the same URL, the last one takes precedence. I can just tack on whatever multiple options I please. (The URL-generating function remains useful for cleanliness.) 
// Some Tumblrs (e.g. http://arinezez.tumblr.com/) have music players in frames. I mean, wow. Tumblr is dedicated to recreating every single design mistake Geocities allowed. 
	// This wouldn't be a big deal, except it means the address-bar URL doesn't change when you change pages. That's a hassle. 
// Images with e.g. _r1_1280 are revisions? See http://actual-vriska.tumblr.com/post/32788651941/ vs. its source http://cancerousaquarium.tumblr.com/post/32784513645/ - an obvious watermark has been added. 
	// Tested with a few random _r1 images from a large scrape's textfile. Some return XML errors ('no associated stylesheet') and others 404. Mildly interesting at best. 
// Aha - now that we have an 'end page,' there can be a drop down for page selection. Maybe also for pages-at-once? Pages 1-10, 11-20, 21-30, etc. Pages at once: 1, 5, 10, 20/25? 
// Terribly basic, but console.log instead of alert(). 
// Possibly separate /post/ links, since they'll obviously be posts from that page. (Ehh, maybe not - I think e.g. Promstuck links to a "masterpost" in the sidebar.) 
	// Maybe hide links behind a button instead of ignoring them entirely? That way compactness is largely irrelevant. 
	// Stick 'em behind a button? Maybe ID and Class each link, so we can GetElementsByClassName and this.innerText = this.href. 
	// If multi-split works, just include T1 / O1 links in-order with everything else. It beats scrolling up and guessing, even vs page themes that suck. 
		// No multi-split. I'd have to split on one term, then foreach that array and split for another term, then combine all resulting arrays in-order. 
		// Aha: there IS multi-splitting, using regexes as the delimiter. E.g. "Hello awesome, world!".split(/[\s,]+/); for splittong on spaces and commas. 
		// Split e.g. src=|src="|src=', then split space/singlequote/doublequote and take first element? We don't care what the first terminator is; just terminate. 
// How do YouTube videos count? E.g. http://durma.tumblr.com/post/57768318100/ponponpon%E6%AD%8C%E3%81%A3%E3%81%A6%E3%81%BF%E3%81%9F-verdirk-by-etna102
	// Another example of off-site video: http://pizza-omelette.tumblr.com/post/44128050736/2hr-watch-this-its-very-important-i-still
// Some themes have EVERY post "under the cut," e.g. http://durma.tumblr.com/. Photosets show up. Replies to posts don't. ?usemobile should get some different stuff. 
// Oh hell, offsite images aren't sorted to the front anymore. E.g. imgur gifs on http://aikostable.tumblr.com/archive?startpage=3?pagesatonce=1?ezastumblrscrape?thumbnails
	// Also page/4 - http://aikostable.tumblr.com/post/103909928836/twi-and-shining-clop-animation - links to http://t.umblr.com/redirect?z=https%3A%2F%2Fe621.net%2Fpost%2Fshow%2F564766&t=YjI4NzdhNDVlMmNmODU0NjgwZDMwNjY1YTYxMzgyYjRiM2RhOGVjZSxNQUZObFlrNQ%3D%3D - shows up in ?showlinks, I guess. 
// Could probably combine some fetch code into a single subroutine if I do conditionals against the passed URL (e.g., contains 'photoset') and then pass back an array. 
// I'd have to redo the soft-scrape again in order to split posts. Even as-is, I should probably be splitting on '<a' and '<img'. 
// New soft-scrape function makes http://ket3.tumblr.com/archive?ezastumblrscrape?scrapewholesite?find=/tagged/my-art create an empty dropdown menu. Try split('http') instead? 
// Offsite / linked images are still sorted backwards. E.g. on http://n4ut.tumblr.com/archive?startpage=31?pagesatonce=1?thumbnails?find=/tagged/my-art?ezastumblrscrape
// Thumbnails shouldn't require a reload. foreach img, change size. 
// Brute-force method: ajax every single /post/ link. Thus - post separation, read-more, possibly other benefits. Obvious downside is massive latency increase. 
	// On the other hand, it's not like we're being kind to Tumblr for images. Would they even notice 10x as many hits for mere HTML? 
	// Test on e.g. akitsunsfw.tumblr.com with its many read-more links. 
	// Probably able to identify new posts vs. reblogs, too. Worth pursuing. At the very least, I could consistently exclude posts whose mobile versions include via/source. 
// <!-- GOOGLE CAROUSEL --><script type="application/ld+json">{"@type":"ItemList","url":"http:\/\/crystalpei.com\/page\/252","itemListElement":[{"@type":"ListItem","position":1,"url":"http:\/\/crystalpei.com\/post\/30638270842\/actually-this-is-a-self-portrait"}],"@context":"http:\/\/schema.org"}</script>
	// <link rel="canonical" href="http://crystalpei.com/page/252" />
	// Does this matter? Seems useful, listing /post/ URLs for this page. 
// Trying out single page updates, i.e., appending bulk_update to mass_bulk_update until all pages are loaded. Do we really care about partial scrapes?
	// Possible alternative: make all divs immediately, then fill them asynchronously. 
	// Instead of 'scraping page,' maybe 'pages scraped?' 'pages remaining?' Or list an array, and show 'waiting on pages x, y, z.' Each page removes itself as it finishes.
// Consider shotgunning the find-last-page function. E.g.: pages 1, 10, 100, 1000, 10,000, 100,000. Instantly ballpark how large it is. Maybe do finer enumeration before scrape? 
// 10,000-page tumblrs are failing with ?showlinks enabled. the text gets doubled-up. is there a height limit for firefox rendering a page? does it just overflow? try a dummy function that does no ajax and just prints many links. 
	// There IS a height limit! I printed 100 lines of dummy text for 10,000 pages, and at 'page' 8773, the 27th line is halfway offscreen. I cannot scroll past that. 
	// Saving as text does save all 10,000 pages. So the text is there... Firefox just won't render it. 
	// Reducing text size (ctrl-minus) resulted in -less- text being visible. Now it ends at 8729.  
// http://shegnyanny.tumblr.com/ redirects to https://www.tumblr.com/dashboard/blog/shegnyanny - from every page. But it's all still there! Fuck, that's aggravating! 
// Could have fit-width and fit-height options like Eza's Pixiv Fixiv... but ugh, the body.className shenanigans. CSS is the devil. 
// I need a 'repeatedly click 'show more notes' function' and maybe that should be part of this script. 
// DaveJaders needs that dashboard scrape treatment. 
// I don't think tumblr actually cares about the subdomain for files. Like 66.media.tumblr.com/tumblr_etc.jpg is the same as 67.media.tumblr.com/tumblr_etc.jpg. They're just cdn numbers handed out possibly at random. Even losing the number entirely works, although Tumblr Image Size doesn't recognize that. 

// Made image links open in new tab (target=_blank). Should ?scrapewholesite be the same? 
	// Did this break GIF resizing? E.g. http://herpes-me.tumblr.com/archive?startpage=2?pagesatonce=1?ezastumblrscrape?thumbnails doesn't show the _500 kanaya gif, because it stays _1280. 
// Modified values for finding last page, to improve average-case number of pages tested. 
// Added ranged scraping for breaking hugeass tumblrs into 1000-page blocks in scrapewholesite mode. 
// Fixed video scraping by visiting pages and getting opengraph URL. This will undoubtedly break on themes without opengraph, because Tumblr, but it should soldier on. 


//document.outerHTML = "";		// Debug hack: fighting e.g. http://shegnyanny.tumblr.com/ trying to redirect to https://www.tumblr.com/dashboard/blog/shegnyanny


// ------------------------------------ Global variables ------------------------------------ //







var highest_page = 0; 		// We need this global variable because GreaseMonkey still can't handle a button activating a function with parameters. It's used in scrape_whole_tumblr. 
var options_map = new Object(); 		// Associative array for ?key=value pairs in URL. 

	var url_index_array = new Array;
	var current_pending_connections = 0; 
	var interval_object; 

	// Here's the URL options currently used. Scalars are at their default values; boolean flags are all set to false. 
options_map[ "lastpage" ] = 0; 		// How many pages to scrape for image links when scouring the whole site. Useful for infinite-scrolling themes that can't be counted automatically. 
options_map[ "startpage" ] = 1; 		// Page to start at when browsing images. 
options_map[ "pagesatonce" ] = 10; 		// How many Tumblr pages to browse images from at once. 
options_map[ "thumbnails" ] = false; 		// For browsing mode, 240px-wide images v.s full-size. 
options_map[ "find" ] = ""; 		// What goes after the Tumblr URL. E.g. /tagged/art or /chrono. 

var mass_bulk_string = ""; 		// Trying to update the page just once, in case updating 1000x is more block-y than appending to a massive string 1000x. 







// ------------------------------------ Script start, general setup ------------------------------------ //





// First, determine if we're loading many pages and listing/embedding them, or if we're just adding a convenient button to that functionality. 
if( window.location.href.indexOf( 'ezastumblrscrape' ) > -1 ) {		// If we're scraping pages:
		// Replace Tumblr-standard Archive page with our own custom nonsense
	var title = document.title; 		// Keep original title for after we delete the original <head> 
	document.head.innerHTML = "";		// Delete CSS. We'll start with a blank page. 
	document.title = window.location.hostname + " - " + title;  

	document.body.outerHTML = "<div id='maindiv'><div id='fetchdiv'></div></div><div id='bottom_controls_div'></div>"; 		// This is our page. Top stuff, content, bottom stuff. 
	document.body.innerHTML += "<style>img{width:auto;} .thumbnails img{width:240px;}</style>"; 		// Auto by default, fixed-size if parent class includes 'thumbnail' 
	document.body.style.backgroundColor="#DDDDDD"; 		// Light grey BG to make image boundaries more obvious 
	var mydiv = document.getElementById( "maindiv" ); 		// I apologize for the generic name. This script used to be a lot simpler. 

		// Identify options in URL (in the form of ?key=value pairs) 
	var key_value_array = window.location.href.split( '?' ); 		// Knowing how to do it the hard way is less impressive than knowing how not to do it the hard way. 
	key_value_array.shift(); 		// The first element will be the site URL. Durrrr. 
	for( dollarsign of key_value_array ) { 		// forEach( key_value_array ), including clumsy homage to $_ 
		var this_pair = dollarsign.split( '=' ); 		// Split key=value into [key,value] (or sometimes just [key])
		if( this_pair.length < 2 ) { this_pair.push( true ); } 		// If there's no value for this key, make its value boolean True 
		if( this_pair[1] == "false " ) { this_pair[1] = false; } 		// If the value is the string "false" then make it False - note fun with 1-ordinal "length" and 0-ordinal array[element]. 
			else if( !isNaN( parseInt( this_pair[1] ) ) ) { this_pair[1] = parseInt( this_pair[1] ); } 		// If the value string looks like a number, make it a number
		options_map[ this_pair[0] ] = this_pair[1]; 		// options_map.key = value 
	}
	if( options_map.find[ options_map.find.length - 1 ] == "/" ) { options_map.find = options_map.find.substring( 0, options_map.find.length - 1 ); } 
			// kludge - prevents example.tumblr.com//page/2 nonsense. 
	if( options_map.thumbnails ) { document.body.className = "thumbnails"; } 		// CSS approach to thumbnail sizing; className="" to toggle back to auto. 

		// Add tags to title, for archival and identification purposes
	document.title += options_map.find.split('/').join(' '); 		// E.g. /tagged/example/chrono -> "tagged example chrono" 

		// Go to image browser or link scraper according to URL options. 
	mydiv.innerHTML = "Not all images are guaranteed to appear.<br>"; 		// Thanks to JS's wacky accomodating nature, mydiv is global despite appearing in an if-else block. 
	if( options_map[ "scrapewholesite" ] ) { 
		scrape_whole_tumblr(); 		// Images from every page, presented as text links
	} else { 
		scrape_tumblr_pages(); 		// Ten pages of embedded images at a time
	}

} else { 		// If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL 

	// Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory. 
	// This is currently beyond my ability to dick with JS through a script in a plugin. Let's kludge it for immediate usability. 

	// kludge by Ivan - http://userscripts-mirror.org/scripts/review/65725.html 
	var url = window.location.protocol + "//" + window.location.hostname + "/archive?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname; 		
		// Preserve /tagged/tag/chrono, etc. Also preserve http: vs https: via "location.protocol". 
	if( url.indexOf( "/page/chrono" ) < 0 ) { 		// Basically checking for posts /tagged/page, thanks to Detective-Pony. Don't even ask. 
		if( url.lastIndexOf( "/page/" ) > 0 ) { url = url.substring( 0, url.lastIndexOf( "/page/" ) ); } 		// Don't include e.g. /page/2. We'll add that ourselves. 
	}

	// Don't clean this up. It's not permanent. 
	var eLink = document.createElement("a");
	eLink.setAttribute("id","edit_link");
	eLink.setAttribute("style","position:absolute;top:26px;right:2px;padding:2px 0 0;width:50px;height:18px;display:block;overflow:hidden;-moz-border-radius:3px;background:#777;color:#fff;font-size:8pt;text-decoration:none;font-weight:bold;text-align:center;line-height:12pt;");
	eLink.setAttribute("href", url);
	eLink.appendChild(document.createTextNode("Scrape"));
	var elBody = document.getElementsByTagName("body")[0];
	elBody.appendChild(eLink);

	// Greasemonkey now supports user commands through its add-on menu! Thus: no more manually typing /archive?ezastumblrscrape?scrapewholesite on blogs with uncooperative themes. 
	GM_registerMenuCommand( "Scrape whole Tumblr blog", go_to_scrapewholesite );
}

function go_to_scrapewholesite() { 
	var site = window.location.protocol + "//" + window.location.hostname + "/archive?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname; 
	window.location.href = site; 
}









// ------------------------------------ Whole-site scraper for use with DownThemAll ------------------------------------ //









// Monolithic scrape-whole-site function, recreating the original intent (before I added pages and made it a glorified multipage image browser) 
	// So for archiving, I need some kind of sister Perl script that goes 'foreach filename containing _500, if (regex _1280) exists, delete this _500 file.' 
function scrape_whole_tumblr() {
	var highest_known_page = 0;
	var site = window.location.protocol + "//" + window.location.hostname + options_map.find; 		// http: + // + example.tumblr.com + /tagged/sherlock 

	// Link to image-viewing version, preserving current tags
	mydiv.innerHTML += "<h1><a id='browse' href='" + options_url( "scrapewholesite", false ) + "?thumbnails'>Browse images (10 pages at once)</a><br></h1>"; 
	// Browse images instead (10 pages at once) / (1 page at once) / Show text links without duplicates (WIP) ?

		// Find out how many pages we need to scrape.
	if( isNaN( options_map.lastpage ) ) { options_map.lastpage = 0; } 
	highest_page = options_map.lastpage; 			// kludge. I'm lazy. 
	if( highest_page == 0 ) { 
		// Find upper bound in a small number of fetches. Ideally we'd skip this - some themes list e.g. "Page 1 of 24." I think that requires back-end cooperation. 
		mydiv.innerHTML += "Finding out how many pages are in <b>" + site.substring( site.indexOf( '/' ) + 2 ) + "</b>:<br><br>"; 		// Telling users what's going on. 
		for( var n = 100; n > 0 && n < 100001; n *= 10 ) { 		// 100,000 is an arbitrary upper bound. It used to arbitrarily be lower, and then I found some BIG tumblrs... 
			var siteurl = site + "/page/" + n + "/mobile"; 
			var xmlhttp = new XMLHttpRequest();
			xmlhttp.onreadystatechange=function() { 
				if( xmlhttp.readyState == 4 ) {
					if( xmlhttp.responseText.indexOf( "/page/" + (n+1) ) < 0 ) { 		// Does this page link to the next page? Pages too far will only link backwards.
						mydiv.innerHTML += siteurl + " is too high.<br>";
						highest_page = n;
						n = -1; 		// break for(n) loop 
					} else {
						mydiv.innerHTML += siteurl + " exists.<br>";
						highest_known_page = n; 
					}
				}
			}
			xmlhttp.open("GET", siteurl, false);		// false=synchronous, for linear execution. No point checking if a page is final if we've already sent requests for the next. 
			xmlhttp.send();
		}

		// Binary-search closer to the actual last page
		// 1000+ page examples: http://neuroticnick.tumblr.com/ -  http://teufeldiabolos.co.vu/ - http://actual-vriska.tumblr.com/ - http://cullenfuckers.tumblr.com/ - http://soupery.tumblr.com - some with 10,000 pages or more. 
		while( highest_page > highest_known_page + 10 ) {		// Arbitrary cutoff. We're just minimizing the range. A couple extra pages is reasonable; a hundred is excessive. 
			mydiv.innerHTML +="Narrowing down last page: ";
			var middlepage = parseInt( (highest_page + highest_known_page) / 2 ); 		// integer midpoint between highest-known and too-high pages

			var siteurl = site + "/page/" + middlepage + "/mobile"; 
			var xmlhttp = new XMLHttpRequest();
			xmlhttp.onreadystatechange=function() { 
				if( xmlhttp.readyState == 4 ) {
					if( xmlhttp.responseText.indexOf( "/page/" + (middlepage+1) ) < 0 ) { 		// Test for the presence of a link to the next page.
						mydiv.innerHTML += siteurl + " is high.<br>";
						highest_page = middlepage;
					} else {
						mydiv.innerHTML += siteurl + " exists.<br>";
						highest_known_page = middlepage; 
					}
				}
			}
			xmlhttp.open("GET", siteurl, false);		// false=synchronous, for linear execution. No point checking if a page is final if we've already sent requests for the next dozen. 
			xmlhttp.send();
		}
	}
	options_map.lastpage = highest_page; 
	document.getElementById( 'browse' ).href += "?lastpage=" + highest_page; 		// Add last-page indicator to Browse Images link

	if( options_map.grabrange ) { 		// If we're only grabbing a 1000-page block from a huge-ass tumblr:
		mydiv.innerHTML += "<br>This will grab 1000 pages starting at <b>" + options_map.grabrange + "</b>.<br><br>";
	} else { 		// If we really are describing the last page:
		mydiv.innerHTML += "<br>Last page is <b>" + options_map.lastpage + "</b> or lower.<br><br>";
	}

	if( options_map.lastpage > 1500 && !options_map.grabrange ) { 		// If we need to link to 1000-page blocks, and aren't currently inside one: 
		for( var x = 1; x < options_map.lastpage; x += 1000 ) { 		// For every 1000 pages...
//			var decade_url = window.location.href + "?startpage=" + x + "?lastpage=" + (x+999); 
			var decade_url = window.location.href + "?grabrange=" + x + "?lastpage=" + options_map.lastpage; 
			mydiv.innerHTML += "<a href='" + decade_url + "'>Pages " + x + "-" + (x+999) + "</a><br>"; 		// ... link a range of 1000 pages. 
		}
	}

		// Add button to scrape every page, one after another. 
		// Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams. 
	var button = document.createElement ('div');
	button.innerHTML = '<button id="myButton" type="button">Find image links from all pages</button>'; 
	button.setAttribute ( 'id', 'scrape_button' );		// I'm really not sure why this id and the above HTML id aren't the same property. 
	document.body.appendChild ( button ); 		// Add button (at the end is fine) 
	document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); 		// Activate button - when clicked, it triggers scrape_all_pages() 
}

function scrape_all_pages() {		// Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how. 
	var button = document.getElementById( "scrape_button" ); 			// First, remove the button. There's no reason it should be clickable twice. 
	button.parentNode.removeChild( button ); 		// The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted." 

	if( !options_map.imagesonly ) { 
		options_map.showlinks = true; 		// For scrapewholesite, include page links by default. 
	}

	// We need to find "site" again, because we can't pass it. Activating a Greasemonkey function from a button borders on magic. Adding parameters is outright dark sorcery. 
	// Use a global variable, idiot. It's fine. Just do it. It's basically constant. 
	var site = window.location.protocol + "//" + window.location.hostname + options_map.find; 		// http: + // + example.tumblr.com + /tagged/sherlock

	mydiv.innerHTML += "Scraping page: <div id='pagecounter'></div><br>";		// This makes it easier to view progress, especially with AJAX preventing scrolling. 

	// Create divs for all pages' content, allowing asynchronous AJAX fetches
	for( var x = 1; x <= highest_page; x++ ) {
		var siteurl = site + "/page/" + x; 
		if( options_map.usemobile ) { siteurl += "/mobile"; } 		// If ?usemobile is flagged, scrape the mobile version.
		var page_tag = siteurl.substring( siteurl.indexOf( '/page/' ) ); 		// Should be e.g. '/page/2' or '/page/2/mobile' 

		var new_div = document.createElement( 'div' );
		new_div.id = '' + x; 
		document.body.appendChild( new_div );
	}


	// Fetch all pages with content on them
	var page_counter_div = document.getElementById( 'pagecounter' ); 		// Probably minor, but over thousands of laggy page updates, I'll take any optimization. 
	url_index_array = new Array;
	current_pending_connections = 0; 
	var begin_page = 1; 
	var end_page = highest_page; 
	if( !isNaN( options_map.grabrange ) ) { 		// If both ?startpage and ?lastpage are defined, grab only that range 
		begin_page = options_map.grabrange;
		end_page = options_map.grabrange + 999; 		// NOT plus 1000. Stop making that mistake. First page + 999 = 1000 total. 
		if( end_page > options_map.lastpage ) { end_page = options_map.lastpage; } 		// Kludge 
		document.title += " " + (parseInt( begin_page / 1000 ) + 1);		// Change page title to indicate which block of pages we're saving
	}
	for( var x = begin_page; x <= end_page; x++ ) {
		var siteurl = site + "/page/" + x; 
		if( options_map.usemobile ) { siteurl += "/mobile"; } 		// If ?usemobile is flagged, scrape the mobile version. No theme shenanigans... but also no photosets. Sigh. 
		//page_counter_div.innerHTML = " " + x; 

		//asynchronous_fetch( siteurl, x ); 			// Sorry for the function spaghetti. Scrape_all_pages exists so a thousand pages aren't loaded in the background, and asynchronous_fetch prevents race conditions.
		url_index_array.push( [siteurl, x] ); 
	}
	interval_object = window.setInterval( fetch_when_ready, 100 );
	//document.getElementById( 'pagecounter' ).innerHTML += "<br>Done. Use DownThemAll (or a similar plugin) to grab all these links.";
}

function fetch_when_ready() {
	//console.log( 'glub' ); 
	/*
	while( current_pending_connections < 3 && url_index_array.length > 0 ) {
		var sprog = url_index_array.shift(); 
		asynchronous_fetch( sprog[0], sprog[1] ); 
		current_pending_connections++; 
		document.getElementById( 'pagecounter' ).innerHTML = " " + sprog[1];
	} 
	*/
	if( current_pending_connections < 10 && url_index_array.length > 0 ) {
		var sprog = url_index_array.shift(); 
		asynchronous_fetch( sprog[0], sprog[1] ); 
		current_pending_connections++; 
		document.getElementById( 'pagecounter' ).innerHTML = " " + sprog[1];
	}
	if( url_index_array.length == 0 ) {
		window.clearInterval( interval_object ); 
		//console.log( 'ribbit' ); 
		document.getElementById( 'pagecounter' ).innerHTML += "<br>Done. Use DownThemAll (or a similar plugin) to grab all these links.";
	}
}

function asynchronous_fetch( siteurl, page ) {		// separated into another function to prevent race condition (i.e. variables changing while asynronous request is happening) 
	var xmlhttp = new XMLHttpRequest();		// AJAX object
	xmlhttp.onreadystatechange = function() {		// When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply)
		if( xmlhttp.readyState == 4 ) {		// Don't do anything until we're done downloading the page.
			var url_array = soft_scrape_page( xmlhttp.responseText );		// turn HTML dump into list of URLs

			// Print URLs so DownThemAll (or similar) can grab them
			var bulk_string = "<br><a href='" + siteurl + "'>" + siteurl + "</a><br>"; 		// Repeatedly adding to innerHTML kills performance, so fill this "digest" and add it all. 
			for( var n = 0; n < url_array.length; n++ ) {
				var image_url = url_array[n][1]; 		// url_array is an array of 2-element arrays. each inner array goes <url, position on page>. 

				// Animated GIFs don't get resized, but still images do, so let's include the original size before altering image_url. 
				if( image_url.indexOf( '.gif' ) > -1 ) { 
					bulk_string += "<a href=" + image_url + ">" + image_url + "</a><br>"; 
				}

				// Some lower-size images are just automatically resized. We'll change the URL to the maximum size just in case, and Tumblr will provide the highest resolution. 
				image_url = image_url.replace( "_540.", "_1280." );  
				image_url = image_url.replace( "_500.", "_1280." );  
				image_url = image_url.replace( "_400.", "_1280." );  
				image_url = image_url.replace( "_250.", "_1280." );  
				image_url = image_url.replace( "_100.", "_1280." );  

				if( options_map.plaintext ) { 
					bulk_string += image_url + '<br>'; 		// Hopefully this reduces strain on Firefox. It leaks and gets weird past about 10,000 pages with links enabled. 
				} else { 
					bulk_string += "<a href=" + image_url + ">" + image_url + "</a><br>";		// "These URLs don't need to be links, but why not?" 13K-page Tumblrs is why not.
				}
				
			}
			var page_div = document.getElementById( '' + page );
			page_div.innerHTML = bulk_string; 

			current_pending_connections--;

//			bulk_string = ""; 		// Debug-ish - garbage collection doesn't seem reliable, RAM use is bloated for 10K+ pages. (No difference.) 
//			url_array = ""; 
		}
	}
	// Every tenth page, synchronous? Wonky timing trick. 
	if( page % 10 == 0 ) { 
		xmlhttp.open("GET", siteurl, false);		// This should be "true" for asynchronous at some point, but naively, it spams hundreds of GETs per second. 
	} else { 
		xmlhttp.open("GET", siteurl, true);
	}
	xmlhttp.send();
}

// Fetch function to replace all fetch functions: take string, return list 
// If indexof('photoset') then treat as a photoset and return image urls. If indexof('video') then treat as a video and return video url. Etc. 
function universal_fetch( siteurl, synchronous ) { 
	var xmlhttp = new XMLHttpRequest();		// AJAX object
	xmlhttp.onreadystatechange = function() {		// When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply)
		if( xmlhttp.readyState == 4 ) {		// Don't do anything until we're done downloading the page.
			var urls = new Array; 
			if( siteurl.indexOf( '/photoset' ) > 0 ) { 
				/*
				// Fetch photoset iframes and put their constituent images in url_array
				if( image_url.indexOf( '/photoset_iframe/' ) > -1 ) { 
					var photoset_xml = new XMLHttpRequest();
					photoset_xml.onreadystatechange = function() {
						if( photoset_xml.readyState == 4 ) { 		// When loaded
							var photo_link_array = photoset_xml.responseText.split( 'href="' ); 		// Doublequotes are sitewide-standard for photosets 
							for( var n = 1; n < photo_link_array.length; n++ ) { 
								var photo_link = photo_link_array[n].substring( 0, photo_link_array[n].indexOf( '"' ) ) + "#photoset"; 		// Isolate link with doublequote terminator, tag as a photoset
								if( n == 1 ) { photo_link += "#" + image_url; } 		// Tag first image in set with photoset URL so browse mode can link to it 
								var sort_order = parseFloat( (0.01 * n) + x ); 
								url_array.push( [ sort_order, photo_link ] );  		// "x + 1 - 1/n" for order on page. E.g. 8.5, 8.333, 8.25, shit they'll sort backwards goddammit. 
							}
						}
					}
					photoset_xml.open("GET", image_url, false);
					photoset_xml.send();

					image_url = ""; 		// Prevent any further action using this URL
				}
				*/
			} /// else if
		}
	}
	xmlhttp.open("GET", siteurl, synchronous);		// This should probably be "true" for asynchronous at some point, but naively, it spams hundreds of GETs per second. This spider script shouldn't act like a DDOS.
	xmlhttp.send();
}









// ------------------------------------ Multi-page scraper with embedded images ------------------------------------ //









function scrape_tumblr_pages() { 		// Create a page where many images are displayed as densely as seems sensible 
		// Figure out which site we're scraping
	var site = window.location.protocol + "//" + window.location.hostname + options_map.find; 		// http: + // + example.tumblr.com + /tagged/sherlock

	var next_link = options_url( "startpage", options_map.startpage + options_map.pagesatonce ); 
	var prev_link = options_url( "startpage", options_map.startpage - options_map.pagesatonce ); 
	options_url( "startpage", 1000 ); 		// debug - I think I'm getting side-effects from copy_map 

	if( !isNaN( parseInt( options_map.startpage ) ) && options_map.startpage <= 1 ) {
		options_map.startpage = 1; 		// Reset in case it's screwy. Negative numbers work, but all return page 1 anyway. 
		var prev_next_controls = "<br><a href='" + next_link + "'>Next >>></a><br><br>"; 
	} else {
		var prev_next_controls = "<br><a href='" + prev_link + "'><<< Previous</a> - <a href='" + next_link + "'>Next >>></a><br><br>"; 
	}
	mydiv.innerHTML += prev_next_controls; 
	document.getElementById("bottom_controls_div").innerHTML += prev_next_controls;

		// Link to the thumbnail page or full-size-image page as appropriate
	if( options_map.thumbnails ) { mydiv.innerHTML += "<a href='"+ options_url( "thumbnails", false ) + "'>Switch to full-size images</a>"; }
		else { mydiv.innerHTML += "<a href='"+ options_url( "thumbnails", true ) + "'>Switch to thumbnails</a>"; }

		// Toggle thumbnails via CSS, hopefully alter options_map accordingly
	mydiv.innerHTML += " - <a href='javascript: void(0);' onclick=\"(function(o){ \
		if( document.body.className == '' ) { \
			document.body.className = 'thumbnails'; } \
		else { \
			document.body.className = ''; \
		} })(this)\">Toggle image size</a>"; 

	if( options_map.pagesatonce == 1 ) { mydiv.innerHTML += " - <a href='"+ options_url( "pagesatonce", 10 ) + "'>Show ten pages at once</a>"; }
		else { mydiv.innerHTML += " - <a href='"+ options_url( "pagesatonce", 1 ) + "'>Show one page at once</a>"; } 
	mydiv.innerHTML += " - <a href='"+ options_url( "scrapewholesite", true ) + "'>Scrape whole Tumblr</a><br>";

		// Grab several pages and extract/embed images. 
	start_page = parseInt( options_map.startpage ); 		// debug-ish. I'll use these more directly soon enough. 
	number_of_pages_at_once = parseInt( options_map.pagesatonce ); 
	for( x = start_page; x < start_page + number_of_pages_at_once; x++ ) {
		var siteurl = site + "/page/" + x; 
		if( options_map.usemobile ) { siteurl += "/mobile"; } 		// If ?usemobile is flagged, scrape mobile version. No theme shenanigans... but also no photosets. Sigh. 
		mydiv.innerHTML += "<hr><b>Page " + x + " fetched</b><br><div id='" + siteurl + "'></div>";		// TODO: Sanitize the URL here and in fetch_page. It's just a unique ID. 
		
		fetch_page( siteurl, mydiv );		// I'd rather do this right here, but unless the AJAX mess is inside its own function, matching a responseText to its siteurl is intractable. 
	}
}

function fetch_page( siteurl, mydiv ) {		// Grab a page, scrape its image URLs, and embed them for easy browsing
	var xmlhttp = new XMLHttpRequest();		// AJAX object
	xmlhttp.onreadystatechange = function() {		// When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply)
		if( xmlhttp.readyState == 4 ) {		// Don't do anything until we're done downloading the page.
			var thisdiv = document.getElementById( siteurl );		// identify the div we printed for this page 		// TODO: Sanitize, as above. Code execution through this niche script is unlikely, but why keep it possible? 
			thisdiv.innerHTML += "<a href='" + siteurl + "'>" + siteurl + "</a><br>";		// link to page, in case you want to see something in-situ (e.g. for proper sourcing) 
			var div_digest = ""; 		// Instead of updating each div's HTML for every image, we'll lump it into one string and update the page once per div. (Twice, counting the page link immediately above this.) 
			var video_array = new Array;
			var outlink_array = new Array;
			var inlink_array = new Array;

			var url_array = soft_scrape_page( xmlhttp.responseText );		// turn HTML dump into list of URLs
			url_array.push( [0, 'this is a kludge'] ); 		// Insert fake final item so url_array[n] doesn't shit itself when the last item is a video/offsite/local link 

			// Separate links
			for( var n = url_array.length-1; n >=0; n-- ) { 
				if( url_array[n][1].indexOf( '#video' ) > -1 ) { video_array.push( url_array[n][1] ); url_array.splice( n, 1 ); } 
				if( url_array[n][1].indexOf( '#offsite' ) > -1 ) { outlink_array.push( url_array[n][1] ); url_array.splice( n, 1 ); } 
				if( url_array[n][1].indexOf( '#local' ) > -1 ) { inlink_array.push( url_array[n][1] ); url_array.splice( n, 1 ); } 
			}
			url_array.pop(); 		// Get rid of fake final item

			// Display video links, if there are any
			for( var n = 0; n < video_array.length; n++ ) {
				div_digest += "Video: <a href='" + video_array[n] + "'>" + video_array[n] + "</a><br>  "; 		// Link the video. 
			}

			// Display page links, if the ?showlinks flag is enabled 
			outlink_array.sort( function(a,b) { return a[0] - b[0]; } ); 		// sort array of [counter, url] sub-arrays by counter value 
			inlink_array.sort( function(a,b) { return a[0] - b[0]; } ); 
			if( options_map.showlinks ) { 
				div_digest += "Outgoing links: ";
				for( var n = 0; n < outlink_array.length; n++ ) { div_digest += "<a href='" + outlink_array[n].replace('#offsite#link', '') + "'>O" + (n+1) + "</a>  "; }
				div_digest += "<br>" + "Same-Tumblr links: ";
				for( var n = 0; n < inlink_array.length; n++ ) { div_digest += "<a href='" + inlink_array[n].replace('#local#link', '') + "'>T" + (n+1) + "</a>  "; } 
				div_digest += "<br>";
			}

			// Embed high-res images to be seen, clicked, and saved
			for( var n = 0; n < url_array.length; n++ ) {
				var image_url = url_array[n][1]; 		// Ease-of-coding hack. 

				// For images which might have been automatically resized, assume the highest resolution exists, and change the URL accordingly.
				image_url = image_url.replace( "_540.", "_1280." ); 		// No need to check for indexOf _540, because replace fails politely. 
				image_url = image_url.replace( "_500.", "_1280." );  
				image_url = image_url.replace( "_400.", "_1280." );  
				image_url = image_url.replace( "_250.", "_1280." );  
				image_url = image_url.replace( "_100.", "_1280." );  

					// This clunky <img onError> function looks for a lower-res image if the high-res version doesn't exist. 
				var on_error = 'if(this.src.indexOf("_1280")>0){this.src=this.src.replace("_1280","_500");}';		// Swap 1280 for 500
				on_error += 'else if(this.src.indexOf("_500")>0){this.src=this.src.replace("_500","_400");}';		// Or swap 500 for 400
				on_error += 'else if(this.src.indexOf("_400")>0){this.src=this.src.replace("_400","_250");}';		// Or swap 400 for 250
				on_error += 'else{this.src=this.src.replace("_250","_100");this.onerror=null;}';							// Or swap 250 for 100, then give up
				on_error += 'document.getElementById("' + image_url + '").href=this.src;'; 		// Link the image to itself, regardless of size

					// Embed images (linked to themselves) and link to photosets
				if( image_url.indexOf( "#" ) > 0 ) { 		// for photosets, print the photoset link.
					var photoset_url = image_url.substring( image_url.lastIndexOf( "#" ) + 1 ); 		
						// separate everything past the last hash - it's like http://tumblr.com/image#photoset#http://tumblr.com/photoset_iframe
					if( photoset_url.substring(0, 4) == "http" ) { div_digest += " <a href='" + photoset_url + "'>Set:</a>"; } 		
						// if the #photoset tag is followed by an #http URL, link the URL 
				} 
				div_digest += "<a id='" + image_url + "' target='_blank' href='" + image_url + "'><img alt='(Waiting for image)' onerror='" + on_error + "' src='" + image_url + "'></a>  "; 
			}
			div_digest += "<br><a href='" + siteurl + "'>(End of " + siteurl + ")</a>";		// Another link to the page, because I'm tired of scrolling back up. 
			thisdiv.innerHTML += div_digest; 
		}
	}
	xmlhttp.open("GET", siteurl, true);		// True = asynchronous. Finally got the damn thing to work! It's a right bitch to do in an inline function. JS scopes are screwy as hell. 
	xmlhttp.send();
}









// ------------------------------------ Universal page-scraping function (and other helper functions) ------------------------------------ //










function soft_scrape_page_redux( html_copy ) { 		
	var url_array = new Array(); 

		// Aha: there IS multi-splitting, using regexes as the delimiter. E.g. "Hello, awesome world!".split(/[\s,]+/); for splitting on spaces and commas. 
	//var http_array = html_copy.split( 'http' ); 
	var http_array = html_copy.split( /['"]http/ ); 
	for( var x in http_array ) { 
		//console.log( http_array[n].substring( 0, http_array[n].indexOf( /['"]/ ) ) );
		//var url = http_array[n].substring( 0, http_array[n].indexOf( /['"]/ ) ); 		// Regexes don't work in indexOf because fuck you. 
		var delimiter = http_array[x].indexOf( '"' ); 
		var delimiter2 = http_array[x].indexOf( "'" ); 
		if( delimiter2 != -1 && delimiter2 < delimiter ) { delimiter = delimiter2; } 
		var url = "http" + http_array[x].substring( 0, delimiter ); 
//		console.log( url ); 
//		http_array[x] = url; 

		// Fetch photoset iframes and put their constituent images in url_array
		// Error console keeps nagging me about synchronicity. Can I fob this off to a function (fetch_photoset(url)) and update the page later? Maybe just store each url_array in some self-updating fashion? 
		if( url.indexOf( '/photoset_iframe/' ) > -1 ) { 
			var photoset_xml = new XMLHttpRequest();
			photoset_xml.onreadystatechange = function() {

				if( photoset_xml.readyState == 4 ) { 		// When loaded
					var photo_link_array = photoset_xml.responseText.split( 'href="' ); 		// Doublequotes are sitewide-standard for photosets 
					for( var n = 1; n < photo_link_array.length; n++ ) { 
						var photo_link = photo_link_array[n].substring( 0, photo_link_array[n].indexOf( '"' ) ) + "#photoset"; 		// Isolate link with doublequote terminator, tag photoset
						if( n == 1 ) { photo_link += "#" + url; } 		// Tag first image in set with photoset URL so browse mode can link to it 
						var sort_order = parseFloat( (0.01 * n) + x ); 		// This is completely fucked. 
						sort_order = x; 
						url_array.push( [ sort_order, photo_link ] );  		// "x + 1 - 1/n" for order on page. E.g. 8.5, 8.333, 8.25, shit they'll sort backwards goddammit. 
					}
				}

			}
			photoset_xml.open("GET", url, false); 
			photoset_xml.send();

			//console.log( url ); 
			url = ""; 		// Prevent any further action using this URL
		}

		// Fetch video iframes and put their (modified) video file addresses in url_array
		if( url.indexOf( ".tumblr.com/video/" ) > -1 ) { 
			var subdomain = url.split( '/' ); 		// E.g. https://www.tumblr.com/video/examplename/123456/500/ -> https,,www.tumblr.com,video,examplename,123456,500
			var video_iframe = window.location.protocol + "//" + subdomain[4] + ".tumblr.com/video/" + subdomain[4] + "/" + subdomain[5] + "/" + subdomain[6]; 		
				// e.g. http://examplename.tumblr.com/video/examplename/123456/500/ 
				// Offsite tumblrs probably fail at this. I need to figure out crossorigin="anonymous" or whatever. CORS is a pain in my ass. 

			var video_xml = new XMLHttpRequest(); 		// Fetch video iframe, get actual video-file address
			video_xml.onreadystatechange = function() {	
				if( video_xml.readyState == 4 ) {		// When loaded
					var video_pointer = video_xml.responseText.indexOf( '<video' ); 		// Jump to <video> tag
					video_pointer = video_xml.responseText.indexOf( 'src="', video_pointer ) + 5; 		// Jump to src for <video> tag
					var video_url = video_xml.responseText.substring( video_pointer, video_xml.responseText.indexOf( '"', video_pointer ) ); 	// Isolate URL to doublequote terminator
					video_url += ".mp4" + "#video"; 		// They're all MP4, right? Also tag it so browse-mode can handle videos separately. 
					url_array.push( [0, video_url] ); 		// 0 as debug. DIsplay order isn't terribly important. 
				}
			}
			video_xml.open("GET", video_iframe, false);
			video_xml.send();

			url = ""; 		// Prevent any further action using this URL
		}



		// Blacklist filter 
		if( url.substring( 0,1 ) == "/" ) { url = ""; } 		// Exclude links within the same blog 
		if( url.indexOf( "/reblog/" ) > 0 ) { url = ""; }
		if( url.indexOf( "/tagged/" ) > 0 ) { url = ""; }    
		if( url.indexOf( ".tumblr.com/avatar_" ) > 0 ) { url = ""; }    
		if( url.indexOf( ".tumblr.com/image/" ) > 0 ) { url = ""; }  
		if( url.indexOf( ".tumblr.com/rss" ) > 0 ) { url = ""; }  
		if( url.indexOf( "assets.tumblr.com" ) > 0 ) { url = ""; }  
		if( url.indexOf( "-app://" ) > 0 ) { url = ""; }  

		// Ditch anything that isn't an image file 
		var image_link = false; 
		if( url.indexOf( ".gif" ) > 0 ) { image_link = true; }
		if( url.indexOf( ".jpg" ) > 0 ) { image_link = true; }
		if( url.indexOf( ".jpeg" ) > 0 ) { image_link = true; }
		if( url.indexOf( ".png" ) > 0 ) { image_link = true; }
		if( image_link == false ) { url = ""; } 

		// Whitelist offsite images from certain domains
		if( url.indexOf( ".tumblr.com" ) < 0 ) { 		// note that this test is different from the others - we blank image_url if the search term is not found, instead of blanking if it is found
			var whitelist = false; 
			if( url.indexOf( "deviantart.net" ) > 0 ) { whitelist = true; } 		// this is a sloppy whitelist of non-tumblr domains 
			if( url.indexOf( "imgur.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "imageshack.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "imageshack.us" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "tinypic.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "gifninja.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "photobucket.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "dropbox.com" ) > 0 ) { whitelist = true; } 
			if( url.indexOf( "flickr.com" ) > 0 ) { whitelist = true; } 
			if( whitelist == false ) { url = ""; } 
		} 

		if( url != "" ) { url_array.push( [ x, url ] ); } 
	}
		
	// Remove duplicate URLs which appear more than once in the same page
//	if( !options_map.scrapewholesite && !options_map.leavedupes) { 		// If we're browsing images, instead of scraping the whole site
	if( !options_map.leavedupes) { 		// If we're browsing images, instead of scraping the whole site
		url_array.sort( function(a,b) { return a[1] > b[1]; } ); 		// Sort our list of array[string_counter, image_url] elements by image_url 
		for( x = url_array.length-1; x >= 1; x-- ) { 		// Foreach url_array, backwards
			if( url_array[x][1] === url_array[x-1][1] ) { url_array.splice( x, 1 ); }  		// If two URLs in a row match, remove one. Doesn't matter which.
		}
	}

	url_array.sort( function(a,b) { return a[0] - b[0]; } ); 		// given two array elements, each of which is a [string_counter, image_url] array, return whichever has the lower string_counter (i.e. a if a-b > 0)

	if( options_map.newscrape ) { return url_array; } else { return soft_scrape_page_2( html_copy ); } 


}




// This scrapes all embedded images, iframe photosets, iframe videos, linked image files, and (optionally) interesting links into an array. Never guaranteed exhaustive.
function soft_scrape_page( html_copy ) { 		
	var url_array = new Array(); 
	var image_array = new Array(); 		// Bare images specifically, to be filtered with a blacklist and whitelist 

	// Grab all <img src> and <iframe src> URLs
	var src_array = html_copy.split( 'src=' ); 		// Not ideal - doesn't handle singlequotes. For now, doublequotes prevail. 
	for( var x = 1; x < src_array.length; x++ ) {
		var terminator = src_array[x].substring( 0,1 ); 		// Quote or doublequote
		var image_url = src_array[x].substring( 1, src_array[x].indexOf( terminator, 1 ) ); 		// "url">etc --> url 
		
		// Fetch photoset iframes and put their constituent images in url_array
		if( image_url.indexOf( '/photoset_iframe/' ) > -1 ) { 
			var photoset_xml = new XMLHttpRequest();
			photoset_xml.onreadystatechange = function() {
				if( photoset_xml.readyState == 4 ) { 		// When loaded
					var photo_link_array = photoset_xml.responseText.split( 'href="' ); 		// Doublequotes are sitewide-standard for photosets 
					for( var n = 1; n < photo_link_array.length; n++ ) { 
						var photo_link = photo_link_array[n].substring( 0, photo_link_array[n].indexOf( '"' ) ) + "#photoset"; 		// Isolate link with doublequote terminator, tag as a photoset
						if( n == 1 ) { photo_link += "#" + image_url; } 		// Tag first image in set with photoset URL so browse mode can link to it 
						var sort_order = parseFloat( (0.01 * n) + x ); 
						url_array.push( [ sort_order, photo_link ] );  		// "x + 1 - 1/n" for order on page. E.g. 8.5, 8.333, 8.25, shit they'll sort backwards goddammit. 
					}
				}
			}
			photoset_xml.open("GET", image_url, false);
			photoset_xml.send();

			image_url = ""; 		// Prevent any further action using this URL
		}

		// Fetch video iframes and put their (modified) video file addresses in url_array
		if( image_url.indexOf( ".tumblr.com/video/" ) > -1 && !options_map.imagesonly ) {
			var subdomain = image_url.split( '/' ); 		// E.g. https://www.tumblr.com/video/examplename/123456/500/ -> https,,www.tumblr.com,video,examplename,123456,500
			var video_post = window.location.protocol + "//" + subdomain[4] + ".tumblr.com/post/" + subdomain[5] + "/"; 
				// e.g. http://examplename.tumblr.com/post/123456/ - note window.location.protocol vs. subdomain[0], maintaining http/https locally 

			var video_xml = new XMLHttpRequest(); 		// Fetch video post page, get actual video-file address via preview filename
			video_xml.onreadystatechange = function() {	
				if( video_xml.readyState == 4 ) {		// When loaded
					var og_pointer = video_xml.responseText.indexOf( 'http://media.tumblr.com' ); 		// Get OpenGraph preview image URL
					var video_name = video_xml.responseText.substring( og_pointer, video_xml.responseText.indexOf( '"', og_pointer ) ); 		// Terminate at doublequote
					video_name = video_name.substring( video_name.lastIndexOf( '/' ) + 1 ); 

					// E.g. tumblr_abcdef12345_frame1.jpg -> tumblr_abcdef12345.mp4
					video_name = video_name.substring( 0, video_name.indexOf( '_', 8 ) ) + ".mp4#video"; 		// First underscore is in 'tumblr_', so ditch from the one past that
					video_name = "https://vt.tumblr.com/" + video_name; 
					url_array.push( [0, video_name] ); 		// Should be e.g. https://vt.tumblr.com/tumblr_abcdef12345.mp4 
				}
			}
			video_xml.open("GET", video_post, false);
			video_xml.send();

			image_url = ""; 		// Prevent any further action using this URL
		}

		// Anything else must be a bare image. 
		image_array.push( [x, image_url] ); 
	}

	// Grab all <a href> URLs 
	// Problem: http://pinterest.com/pin/create/button/?url=http%3A%2F%2Ftmblr.co%2FZC24zx14XkIft&media=http%3A%2F%2F41.media.tumblr.com%2F6cc3f9c2842ae2cac84fb6514a416200%2Ftumblr_mzj17rx5gg1qfv2fjo1_100.png 
	// Blacklist above is not applied to apparent 'images' here.' Shit. 
	var href_array = html_copy.split( 'href=' ); 
	for( var x = 1; x < href_array.length; x++ ) { 
		var link_url = href_array[x];

		//var terminator = link_url.substring( 0,1 ); 		// Singlequote or doublequote 
		//link_url = link_url.substring( 1, link_url.indexOf( terminator, 1 ) ); 

			// Test new code on Soupery.tumblr.com, but with ?lastpage=10 (or lower). It got weird. (Erg. Can't replicate weirdness for testing.) 
			// Odd empty spaces from abbydraws.tumblr.com - same number of image links. 
			// Wonky stuff on nevernoahh.tumblr.com, because some theme link goes 'href= "http[etc]'. 
		link_url = link_url.substring( 1 ); 		// Remove first character, since it ought to be singlequote or doublequote
		var string_terminators = ['"', "'", ">", "<", "\n"]; 
		for( var y = 0; y < string_terminators.length; y++ ) {

			if( link_url.indexOf( string_terminators[y] ) > 0 ) { 		// If a terminator is found, 
				link_url = link_url.substring( 0, link_url.indexOf( string_terminators[y] ) ); 		// then terminate the string. 
			}
		}



		var local_link = false;
		var image_link = false; 

		// Blacklist filter 
		if( link_url.substring( 0,1 ) == "/" ) { link_url = ""; } 		// Exclude links within the same blog 
		if( link_url.indexOf( "/reblog/" ) > 0 ) { link_url = ""; }
		if( link_url.indexOf( "/tagged/" ) > 0 ) { link_url = ""; }    
		if( link_url.indexOf( ".tumblr.com/avatar_" ) > 0 ) { link_url = ""; }    
		if( link_url.indexOf( ".tumblr.com/image/" ) > 0 ) { link_url = ""; }  
		if( link_url.indexOf( ".tumblr.com/rss" ) > 0 ) { link_url = ""; }  
		if( link_url.indexOf( "assets.tumblr.com" ) > 0 ) { link_url = ""; }  
		if( link_url.indexOf( "-app://" ) > 0 ) { link_url = ""; }  

		// Treat intra-blog links differently 
		if( link_url.indexOf( window.location.host ) > 0 ) { local_link = true; } 

		// Treat direct image links differently 
		if( link_url.indexOf( ".gif" ) > 0 ) { image_link = true; }
		if( link_url.indexOf( ".jpg" ) > 0 ) { image_link = true; }
		if( link_url.indexOf( ".jpeg" ) > 0 ) { image_link = true; }
		if( link_url.indexOf( ".png" ) > 0 ) { image_link = true; }

		var sort_order = 0.01 * x; 		// Map some order between 0 and 1, just for sortability 

		if( link_url != "" ) { 
			if( image_link ) { image_array.push( [ -1.0 - sort_order, link_url ] ); } 		// Regular image, funny sorting order. Bin it so it can be filtered by the next step. 
				else if( local_link && options_map.showlinks ) { url_array.push( [ -2.0 - sort_order, link_url + "#local#link" ] ); } 		// Local link (if flag is set) 
				else if ( options_map.showlinks ) { url_array.push( [ -3.0 - sort_order, link_url + "#offsite#link" ] ); } 		// Offsite link (if flag is set)
		}
	}

	// Blacklist Tumblr URLs, whitelist offsite URLs 
	for( var x = 0; x < image_array.length; x++ ) {
		image_url = image_array[x][1]; 		// Ease of coding hack
		
			// Exclude a bunch of useless nonsense with a blacklist
		if( image_url.indexOf( "assets.tumblr.com" ) > -1 ) { image_url = ""; } 		// let's ignore avatar icons and Tumblr stuff.
		if( image_url.indexOf( "static.tumblr.com" ) > -1 ) { image_url = ""; } 		// Not actually filtering? See http://durma.tumblr.com/archive?lastpage=194?startpage=53?pagesatonce=1?thumbnails?ezastumblrscrape?showlinks - getting CSS guff as "outgoing links." 
		if( image_url.indexOf( "srvcs.tumblr.com" ) > -1 ) { image_url = ""; } 
		if( image_url.indexOf( "www.tumblr.com" ) > -1 ) { image_url = ""; } 
		if( image_url.indexOf( "/avatar_" ) > 0 ) { image_url = ""; } 
		if( image_url.indexOf( "%3A%2F%2F" ) > -1 ) { image_url = ""; } 		// Maybe remove this 

			// Include potentially interesting nonsense with a whitelist
			// General offsite whitelist would include crap like Facebook buttons, Twitter icons, etc. 
		if( image_url.indexOf( ".tumblr.com" ) < 0 ) { 		// note that this test is different from the others - we blank image_url if the search term is not found, instead of blanking if it is found
			var original_image_url = image_url; 
			image_url = ""; 

			if( original_image_url.indexOf( "deviantart.net" ) > 0 ) { image_url = original_image_url; } 		// this is a sloppy whitelist of non-tumblr domains 
			if( original_image_url.indexOf( "imgur.com" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "imageshack.com" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "imageshack.us" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "tinypic.com" ) > 0 ) { image_url = original_image_url; }			// this originally read "tinypic.com1", but I assume I was drunk. 
			if( original_image_url.indexOf( "gifninja.com" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "photobucket.com" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "dropbox.com" ) > 0 ) { image_url = original_image_url; } 
			if( original_image_url.indexOf( "flickr.com" ) > 0 ) { image_url = original_image_url; } 
		} 

		if( image_url != "" ) { url_array.push( image_array[x] ); } 
	}

	// Remove duplicate URLs which appear more than once in the same page
	if( !options_map.scrapewholesite && !options_map.leavedupes) { 		// If we're browsing images, instead of scraping the whole site
		url_array.sort( function(a,b) { return a[1] > b[1]; } ); 		// Sort our list of array[string_counter, image_url] elements by image_url 
		for( x = url_array.length-1; x >= 1; x-- ) { 		// Foreach url_array, backwards
			if( url_array[x][1] === url_array[x-1][1] ) { url_array.splice( x, 1 ); }  		// If two URLs in a row match, remove one. Doesn't matter which.
		}
	}

//	html_copy = ""; 		// Debug-ish - garbage collection seems unreliable.

	url_array.sort( function(a,b) { return a[0] - b[0]; } ); 		// given two array elements, each of which is a [string_counter, image_url] array, return whichever has the lower string_counter (i.e. a if a-b > 0)
	return url_array; 
}






// Returns a URL with all the options_map options in ?key=value format - optionally allowing one key-value pair to change in that URL
function options_url( key, value ) {
	var copy_map = new Object(); 
	for( var i in options_map ) { copy_map[ i ] = options_map[ i ]; } 		// In any sensible language, this would read "copy_map = object_map." Javascript genuinely does not know how to copy objects. Fuck's sake. 
	if( key ) { 		// the parameters are optional. just calling options_url() will return e.g. example.tumblr.com/archive?ezastumblrscrape?startpage=1
		if( !value ) { value = false; } 		// if there's no value then use false
		copy_map[ key ] = value; 		// change this key, so we can e.g. link to example.tumblr.com/archive?ezastumblrscrape?startpage=2
	}

	// Construct URL from options
	var site = window.location.href.substring( 0, window.location.href.indexOf( "?" ) ); 		// should include /archive, but if not, it still works on most pages
	for( var k in copy_map ) { 		// JS associative maps are weird. We're actually setting attributes of a generic object. So options_map[ "thumbnails" ] is the same as options_map.thumbnails.
		if( copy_map[ k ] ) { 		// Unless the value is False, print a ?key=value pair.
			site += "?" + k; 
			if( copy_map[ k ] !== true ) { site += "=" + copy_map[ k ]; }  		// If the value is boolean True, just print the value as a flag. 
		}
	}
	return site; 
}
Raw Source Toggle Dropdown Minified Source Ezalias / Eza's Tumblr Scrape

Donate for the site OpenUserJS

Raw Source

Minified Source

Ezalias / Eza's Tumblr Scrape