AI Generated Hacker News Comments Project Breakdown
I've been scraping hacker news comments for another project, and while I had the data I figured I would use it to ship a quick project.
The idea is simple: I pretty much only read the comments on HN, almost never the actual article (ain't nobody got time for that). By training AI to generate comments, you can skip reading the article for anything you want, not just the ones that get a lot of attention.
Originally I envisioned having the model fetch the url and reading the actual article or a summary of it, but in the interest of shipping as close to inside of a weekend as possible the model is predicated on only the title and url. Previous work (salesforce CTRL) has shown that you can generate reasonable news articles from the url alone, so it's not far-fetched to expect reasonable comments from the title/url alone.
Tech Stack
Django + DRF for backend
React for frontend (I previously used Django+DRF+React with additionall Daphne/Channels for websockets for Turnbase)
gpt-2-simple (I have previously used this to generate, uh, interesting works of literature)
Pipeline
Most people are naturally attracted to the modeling aspects of machine learning, but to deploy a production ML system you need to think about/work on the whole pipeline. For this project, I probably spent less than 10% of the total work time finetuning the model. Almost all of the work is in data processing and building the UI.
Data acquisition
This is pretty straightforward as HN has a pretty clean API, but did take a few days to scrape (I didn't count this as part of the project time as nominally it was for another project)
Data cleaning/processing
I originally kept each HN item as its own text file (because didn't want to bother escaping new lines), but this turned out to make things too slow. As it turns out, it's not good to have a huge (23mm) number of very small files. So then I paged them together a few 100ks at a time.
Then, I wrote structures for tree traversal so that you could display the entire comment chain. Now, a child to a particular comment might be stored in another page depending on how much time elapsed between them, so to build out the tree for a particular root, you might need to seek ahead to multiple pages. You can't keep all of the data in memory (maybe you could, I couldn't because my machine wasn't beefy enough...), only a few pages at a time, so I had to write a caching mechanism for this. So this took longer than expected.
Cleaning was not too bad as the data is mostly clean. I mostly just filtered out dead comments.
It was not clear the best way to format the final training text files. How do you let the model know that a comment is a child to another? Do you write out all the children of a root item (a story) in a nested format (so that each story gets one sample in the training file), or do you write it out one reply chain at a time (so that each story generates multiple samples, which you then need to cap to make sure the most popular stories aren't overrepresented), etc.
In the end I used
<|c|>
and<|ec|>
tokens to start/end a children block, sorted the children (HN API gives you the rank but not the score for the children), and limited to 10 children per item (so in theory we get only the best comments) with no limits on the depth (so comment chains can go as long as they want). In theory, the model should learn also the distribution of how many replies an item is likely to get this way (with the cap of 10 slightly modifying the distribution)The whole thing is dumped into a single file with
<|endoftext|>
to delimit stories.Note
Hilariously, I found a bug with how
<|endoftext|>
is used that may explain some previous weirdnesses I'd seen with gpt-2-simple.
Model training
I kept things basic and didn't spend too much time tuning parameters. I think you need MLFlow or similar if you want to do any real tuning, otherwise you end up with a bunch of models with names like
hn_model_lr_0001_final2_noclip
, but didn't attempt setting it up for this project.Used the 335m (medium) gpt2 model, trained on a p3.2xlarge.
Model deployment + serving
The simplest architecture would be to keep a copy of the model on the same machine that serves the website. Whenever someone submits a new story that they want AI generated comments for, the backend would then invoke the model. You wouldn't want to invoke it right away, you'll need to put it in a queue and thread or the website will freeze whenever the model is thinking.
I was a cheapskate and the machine that serves the website is a t2.micro, so inference wasn't even possible as it OOMs
I investigated whether you could get by with a t2.large. You can do inference (no OOM), but with the CPU it's like 40 or 60x slower than a p3.2xlarge (on which inference takes about 30s with overhead for loading the model and stuff, so more like 50s if that's needed). I felt like that would make the user experience too poor.
So instead, there are two machines (website and inference). The website's DB acts as a container for the queue. There is an endpoint to broadcast if there are stories awaiting generation. Then, a script continuously checks the endpoint. When there are stories, the script then sshs into the second machine and kicks off a second script that lives on that machine to do inference (this way I didn't have to set up endpoints on the second machine, saving myself some work, although this SHOULD be fairly straightforward with an MLFlow workflow). This also means that inference can be batched which is a little bit more efficient due to the overhead of loading the model and setting stuff up.
It's realy janky, but it works, and it's cheap. You could set it up so that the second machine is turned off most of the time and only spun up for inference and then shut down (adding slightly to the overhead per inference). It works out pretty well since inference is on the order of a minute which is the minimum time block AWS will bill you for. I just turn it on manually whenever there are things in the queue and I'm paying attention, which is almost never.
Tips/Summary
This post has gone on long enough, so I'll just quickly summarize with a few tips:
Think about your whole pipeline. You need tooling for the entire pipeline so that you can iterate (your data format may change, your encoding may change) quickly.
For gpt2 specifically, encode your data before you finetune.
Inference costs need to be thought about. I was kind of expecting that most of the cost would be in training the model, so it was an unwelcome surprise to discover that I needed a machine as powerful as the training machine to do inference. In fact, for typical production models, the inference cost should way outweigh the training costs (unless you're updating your model constantly), so it's much more important to make sure that inference is efficient and economical. My interest in the distill* family of algorithms has gone up since this project.
Funny enough, gpt3 came out the day after I shipped this project. I don't want to think about deploying that in a production system.
Addendum (2020-07-11)
I tried deploying a GPT2 model with MLFlow. MLFlow can build a docker image for you with the model, artifacts, and any necessary libraries packaged inside. I've used it to deploy simpler model before and prefer this work flow because I don't want to mess with setting up conda and conceptually this is clean. However, I ran into some kind of CUDA error. This seems to be an issue with tensorflow+MLFLow build docker specifically.