doranwen

I've been working on processing Yahoo Groups metadata for ages (something like seven months now), and probably few of you actually care what all that has involved, but I decided to take a break from it this evening to type up the process for any data processing geeks out there.

Relevant Vocabulary:

ArchiveTeam = Often abbreviated to AT, a loose group of individuals who work to save data from websites before it is deleted and store it on the Internet Archive. They are NOT actually affiliated with the Internet Archive, though!

GMD = Get My Data, zip files Yahoo provided of group messages, files, and links (but almost never photos) to anyone who was joined to a group.

checker = A Python script that was written by an ArchiveTeam volunteer to grab a set of metadata from Yahoo for a group. I was soon given (by another AT volunteer) a short Perl script that would take a txt file of group names and feed them to the checker one by one, then add each resulting line of info to a tsv file. The data grabbed by the checker was the code status (200/401/404 = active/private/dead), membership (OPEN, RESTRICTED, CLOSED), and a handful of other bits of info such as whether the messages were moderated or not.

descriptions = My shorthand for group metadata that was grabbed with the Python script the ArchiveTeam (and a few of us fandom project members) used to grab the data from Yahoo before Dec. 15, when Yahoo shut down user access to archives, files, and photos. After that date, the only part of the script that still worked was the flag that pulled the metadata. The metadata came in the form of two json files (about.json and statistics.json), and the script also produced an archive.log which recorded the timestamp of the grab as part of the record of the actual information being pulled. All of this would be dumped into a folder with the name of the group on it, exactly as it was requested from Yahoo.

PurpleSym json = A massive file (or two, really) put together by an ArchiveTeam volunteer that contains tons of metadata about various groups, collected over varying dates that range somewhere between 2014 and late 2019.

Important Info to Remember:

I run Linux as my primary OS. Linux is case-sensitive. A folder called MyFolder and one called myfolder can happily coexist in the same umbrella folder and are treated as distinct folders.

Yahoo is NOT case-sensitive. Requesting information (from my Linux system) on a group called MyGroup and on one called mygroup would retrieve the same information (because it's only one group), duplicated in two folders each with their respective title. (The only real difference was a randomly generated string that was prefixed to the description field in the individual json files that came with each group's metadata. This made it easy to distinguish groups that were completely duplicated on the spreadsheet from ones that had the same group ID but were slightly different in terms of capitalization.)

Background Info:

The Yahoo Groups Fandom Rescue Project saved over 300,000 fandom groups. The ArchiveTeam saved multiple hundreds of thousands more groups, some of which were also fandom. I was one of the few fandom project volunteers who ran Linux and could handle the scripts easily, and I was also the only member of the fandom project who was on IRC all the time. (Yuletide might have moved to Discord but I still love IRC - and that happens to be where the ArchiveTeam coordinates their projects from.) This meant I was able to pick up a lot of useful terminal commands and scripts such as the checker (built for me specifically) from ArchiveTeam volunteers. One other project member (

onceamy) did end up using some of those tools and building scripts to help me (taking some of the work off my shoulders) but many tasks had to stay with me, as you will see.

The vast majority of fandom project target groups were first run through the checker and dumped onto spreadsheets, from which there was some filtering. In the very early stage most groups recommended were by fandom in general, and individuals browsing the directories. By the middle of the project, however, we were getting the next set of groups to join from searches of the PurpleSym json (for various fandom keywords), with additions from groups mentioned in links of other groups we had just gotten back the data for. Since many groups had been invaded by porn spambots, those links were frequently cluttered with Adult groups, only a few of which were valid fandom ones. (I was fortunately not the project member who had to go through those groups to identify ones worth joining and saving, lol.) Any NON-spam links were created by actual human beings - who did not always remember which capitalization was the valid one for their group. (I don't blame them, in some cases - you would boggle at the creative caps variations on some groups' names, and wonder how anyone could remember those!) As a result, the spreadsheets created by running this information through the checker had a good number of groups with incorrect capitalization. Since Yahoo didn't care about case, this didn't affect grabbing the correct metadata (which included a field of the actual group name, caps and all) - but it wasn't accurate to the way the group was created.

As February came along, and with it the deadline for requesting data from Yahoo, the ArchiveTeam volunteers began to provide us with lists of groups from their GMDs (as we did the same for them) in an effort to avoid too much duplication. (Had we worked together more closely from the beginning, we might have been more efficient, but hindsight is 20/20, as they say.) At the same time, I moved all of the group descriptions I had for the fandom groups up to this point into a single folder and then began grabbing the descriptions for all of the AT GMD groups as well. (I made a point of never trying to go into that folder with a GUI program - it would have keeled over and crashed, lol. By the end I had nearly a million folders in that single folder!) When we finally got the last lists of groups, I constructed a very simple script (with some help, it was my first script ever, lol) to move all the groups into folders by two-letter pairs. So Aa, Ab, Ac, Ad, etc. were all in their own subfolders. (Groups beginning with _ and - and any numbers went under a folder for that which had subfolders for each character, and I had separate subfolders for letter+punctuation and letter+number for each of the 26 letters, such as A_- and A0-9.) Some letter pairs hardly needed it - I believe Ux or some similarly unusual pair had only about 12 or 13 group description folders - but they got as large as Th with ~30,000 group description folders in it. (Keep in mind that this makes for 740 different folders to organize the group descriptions in. So every time I say I had to do something for each two-letter pair… I had to do that task 740 times!)

And then the real metadata processing could begin, lol. Since no one else had bothered to grab descriptions for so many groups like this (the AT focused on getting GMDs and basically abandoned the script once they couldn't use it to grab files and photos), and there was so much that was going to require pulling data out of the actual group jsons, much of the work could not be outsourced to anyone else but me, since I was the only one with the folders on their hard drive. You may wonder why I didn't try to pack them all up and send somewhere else - but some of the stages below, as you will see, required me to edit within those folders somewhat, so there really was no other choice.

Metadata Processing Steps:

I sat down and planned out all the stages of processing the data, logically thinking through the order they would need to happen. Here's what I had to do:

Before I started, I separated all of the spreadsheet groups by letter, splitting some to keep them reasonable size. Most were fine as a single spreadsheet, but A, C, M, P, and T needed two, and S needed three. The punctuation made for one very tiny spreadsheet - groups beginning with those all began before 2000, probably grandfathered in from a previous email list service such as Egroups or Onelist - and the numbers were also relatively small. The end result is 35 different spreadsheets of varying sizes from 260 kb to 62.6 mb (total 1.1 gb).

Stage 1: If you'll recall, I had tons and tons of group descriptions that weren't on the spreadsheet at all, because I had gotten them from the AT GMD lists and never tossed them through the checker. The solution was to generate lists from both the spreadsheet and the folders for each two-letter pair, compare those lists to make a list of groups that I had descriptions for but that were NOT on the spreadsheet, and then run the resulting list of groups through the checker to add them to the spreadsheet. Running lists through the checker took some time, since if Yahoo hiccuped, it would return 500 or 503 for some groups, and one would have to grab those few and run them through separately to a different tsv and then merge once they came back as 200/401/404. Another part of this was running a command to generate the full group URL for the group (as Yahoo provided the URL prefix, to which you just had to add neo/groups/, insert the group name, and then add /info). This was another command given to me by an AT volunteer (yay regular expressions). I outsourced both the checker and URL parts but did the list comparisons myself. This stage took something like a month. (I'd have to go back to the mod convos to check dates I started/finished to be sure.)

Stage 2: The checker only pulled half the data we wanted. We needed the actual group description, language code, whether it was flagged as adult or not, the category path, number of members, etc. These had to be pulled from the group description jsons. So my next step was going into each two-letter pair folder and creating a list of the groups on the spreadsheet for that pair, then running a command to pull all that data from each group's jsons and dump into a tsv that I could c/p from into the spreadsheets. This also took a bit of time, probably another month.

Stage 3: A decent number of groups went dead (or private) between when I grabbed the descriptions and when they were run through the checker in stage 1. These groups had the second half of the data (pulled in stage 2) but were missing the first half. Fortunately all of that data was in the jsons as well, and I was able to use autofilter on the spreadsheets in a major way to identify them, pull the data (broken up into two-letter blocks, but only for the two-letter pairs had were represented, which was only some of them), and then merge them back onto the spreadsheets. In a few cases I had the opposite problem, having first half of data from the checker but having never grabbed descriptions for the groups in question, and I grabbed them then, as far as possible (some were dead by that point) and added their data in.

Stage 4: Remember the trouble with capitalization that I mentioned before? I knew full well that I had duplicates on the spreadsheets, but fortunately I could look at the name field (pulled from the json in stage 2) to identify which was the correct one. In this stage I had to dump lists of names into a spreadsheet just for comparison purposes, sort it in such a way that it would filter all the dupes to the top, and go through them one by one to find them on the spreadsheet, figure out if they were complete dupes (I had some that were fully duplicated, caps and all, on the spreadsheet two or even three or four times) or dupes except for the caps, and in the case of the latter identify the correct capitalization. Once the correct capitalization had been identified for the ones with different caps, I would go to the group description folders and make sure I kept the earliest grab but made sure it had the correct capitalization name on the folder. This often involved copying a folder name, deleting the folder, and overwriting the name on the other folder. The worst part of this stage was it was tedious enough to want to listen to a talk of some kind, but the only thing I could have running in the background was familiar music because I had to pay such close attention to what I was doing with each duplicate. This also took probably around a month or so.

Stage 4.5/5: As I went through the duplicates in stage 4, that being a very time-consuming and tedious process in which I ended up scrutinizing the spreadsheets more closely than usual, I realized the data was still very very messy. I had planned on stage 5 being a few simple tasks: converting the date created from the Unix time to a normal human-readable date, adding a column with true/false for if we had a group photo (only available in some of the groups grabbed before Dec. 15), and deleting the useless URL prefix field (now that we had the full URL). However, I realized there were four intermediate stages that needed to happen (which I helpfully called 4.5a, b, c, and d, lol), and ultimately merged the stage 5 tasks into those. This is where I'm at now, having completed through G entirely, and being partway through this with H and I (J onward is still waiting for me to work through it with these tasks).

- 4.5a: In some cases we had two grabs of group descriptions but the group only appeared on the spreadsheet once, so it wasn't caught with the de-duping task in stage 4. Fixing this involved generating lists from only the folders (which is much easier than from the spreadsheet, since with the folders I can do a simple command to generate the list: ls Aa/ > Aa.txt, for instance). I outsourced the actual duplicate-hunting, but have to take the results and go find the groups in question both on the spreadsheet and in the folders, and do much as I did with the more complicated dupes in stage 4 - identify the correct caps, keep oldest grab but make sure it has the right name. Fortunately not every two-letter pair has groups affected by this and some have only one or two groups affected.

- 4.5b: Remember that we got a bunch of groups from scrounging the links of other groups for new groups to join? Some of the commands used to process that data generated "groups" that never existed (with http: stuck at the end, apostrophes or commas in them, etc.). Also one stage of the spreadsheet work ended up with a certain number of groups getting a duplicate version added to the spreadsheet with _dupe after the name. So for this stage I send the spreadsheets to my assistant who runs a script against them to find groups with punctuation in them or _dupe at the end. A very very tiny number of very old (grandfathered from who knows which list service) groups actually legitimately have periods in their names, but in most cases groups with periods never existed either. This process is fairly quick for each letter but varies greatly in what has to be done, as sometimes group folders are affected (and some punctuation marks Yahoo simply ignored everything from that mark onwards and treated the letters before it as a group name).

- 4.5c: There are a large number of 401/404 groups on the spreadsheet with no data about them, but many of them existed as recently as a year or two before now - and a small chunk can be found in the PurpleSym json. At this point I do two of the stage 5 tasks (removing the URL prefix and adding the group photo column), and since adding the group photo column requires doing that for each two-letter pair, I decided to add the other task I had recently thought of for that - adding a column that would have the date the group descriptions were grabbed. (This was a fairly easy command to create; I was able to modify the other commands I had been given to use head -c 10 on the archive.log file to pull that.) I have to do both columns for every two-letter pair on the spreadsheet first, then I can take the results from searching the json and work with it to create a spreadsheet with data I can import into the main spreadsheet. This is more complicated than that sounds, even, since the descriptions (and sometimes the trailer column, another field Yahoo had) in the json frequently come with line breaks in them, which requires copying/pasting into a WYSIWYG HTML editor, flipping to the HTML to replace the br tag with \n to match Yahoo's use of newline formatting, and then overwriting the cell with the reformatted content. (There are generally 100-300 such groups in each reasonable-sized spreadsheet - ones like Q or Z will have far fewer - but it varies whether just the description has to be reformatted, just the trailer, both, or none.) Also, as I create the two-letter pair lists to run the commands for the group photo and info date columns, I paste the spreadsheet lists into separate text files (and also generate a fresh set of lists for the description folders), just so I can outsource those for one last set of comparisons, to make sure I find any group folders accidentally overlooked or missed in stage 4 or 4.5a. And once again, the actual comparisons of the lists can be outsourced, but I have to deal with the results. Fortunately, since this is basically looking for any mistakes I made or things I forgot to do, there's not all that much to fix.

- 4.5d: For this stage I once again outsource a task: I send the spreadsheet to have a script run against it to compare the main name column with the one from the json, to identify groups that have only one name on the spreadsheet and only one description folder - but both are the wrong caps. Although the list is long enough (up to 100 different names), this is at least an easy fix as all I have to do is copy the correct name from the true name column and paste over the main name column and the folder name.

Either in 4.5c or (if I forgot then) in 4.5d I then fix the date for every group. This is pretty simple - I have the formula =CELL/86400+25569 to convert Unix time to normal dates, set up in a spreadsheet to run to over 12,000 rows, and I just paste in a block of rows into that spreadsheet, copy/paste back the result, and format the affected cells to show the date as I want. 3-5 sets of this and I'm done with that, so it goes quickly.

And at this point I'm essentially done - except the info pulled in 4.5c doesn't have group member numbers, so I actually dump the names and URLs of the affected groups into a Google spreadsheet where I have a volunteer checking the URLs in the Wayback Machine to see if they can find member numbers for any of them. So I'll have to merge that data back, but I'm not doing that until I'm finished with all of the spreadsheets for the above tasks. At that point I will finally be done processing the Yahoo Groups metadata, and then it can be uploaded so people can use the information. The metadata, at least, was publicly available for the most part, and it violates no privacy issues to share it, so it should definitely be publicly available for everyone to see and be able to use.

You see why I am so tired of working on these spreadsheets, lol. But I have to remind myself that I've got probably less than a month left at this point… So close!

Flat | Top-Level Comments Only

From:

duskpeterson

This has got to be geek porn for those in the know. As it is, it makes me feel very reassured that the data is in the right hands.

arcadianmaggie

Nice write up. Maybe I should do one about the porn scanning step of the project (still scarred!). LMAO

doranwen

ROFL - hey, with the 4.5c stage, I'm having to reformat all those descriptions, and a good number of them are porn groups. Plus one can't escape the group names, period. Having just finished the G's, I had all the "gay" groups as well as the "guy" groups to wade through. So. Much. Porn. (I mean, there are legit non-porny groups that start with those character strings too, but. They're in the minority, lol.)

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Yahoo Groups metadata processing

Yahoo Groups metadata processing

no subject

no subject

no subject

Profile

June 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags