Welcome! The goal of this repo is to learn new things from HackerNews without having to read everything.
Observations
soft quotes in hn_comment_nouns_common
$ cat hn_comment_nouns_common 126179 ”
There are 126,000+ quotations on HackerNews using smart quotes. I find this deeply disturbing.
Intel vs AMD
First we look at hn stories (the posts that people comment on but not the comments themselves). Stories include titles and text (what reddit calls selftext).
$ rg 'Intel|AMD' hn_story_nouns_common
3921 AMD
5477 Intel
ripgrep recognizes hn_comment_nouns_common as a binary file and I don’t know how to fix that so using grep:
$ grep --binary-files text -E 'Intel|AMD' hn_comment_nouns_common
49167 AMD
87458 Intel
And Intel is mentioned even more in the comments (probably because AMD makes programmers’ lives easier so there is no need to complain)!
AI vs Blockchain
$ grep --binary-files text -E 'AI|blockchain' hn_comment_nouns_common
68368 blockchain
160277 AI
I’d guess 99% of blockchain references are either people complaining about the inefficiencies of blockchain as a database or scammy spams.
Sun vs Moon
$ grep --binary-files text -E 'sun|moon' hn_comment_nouns_common
30905 moon
36100 Samsung
44937 sun
This is closer than I thought it would be!
$ grep --binary-files text -E 'Apollo$|Helios$|Ra$' hn_comment_nouns_common
38 Sun Ra
49 React+Apollo
230 The Apollo
278 Helios
762 LoRa
7299 Apollo
But I’m more surprised that there are only 0 or 1 mentions of the sun god Ra.
Feel free to dig into the data and open a PR: edit this README to add more observations
Watch HackerNews from your CLI
Explore 39 years of content without leaving your teletype machine!
pip install xklb wget https://github.com/chapmanjacobd/hn_mining/raw/main/hackernews_only_direct.tw.db library watch hackernews_only_direct.tw.db --random --ignore-errors
$ lb pl hackernews.tw.db -a ╒════════════════════════╤═════════════════╤═════════════════════════════════╤═══════════════════╤════════════════╕ │ path │ duration │ avg_playlist_duration │ playlists_count │ videos_count │ ╞════════════════════════╪═════════════════╪═════════════════════════════════╪═══════════════════╪════════════════╡ │ Aggregate of playlists │ 39 years, 2 │ 4 days, 14 hours and 58 minutes │ 3098 │ 741500 │ │ │ months, 27 days │ │ │ │ │ │ and 20 hours │ │ │ │ ╘════════════════════════╧═════════════════╧═════════════════════════════════╧═══════════════════╧════════════════╛
This is what I mean by 39 years of content. 39 years of video running 24/7 (not including 62,876 videos [~8%] where duration is unknown).
$ lb wt hackernews_only_direct.tw.db -pa ╒═══════════╤═════════════════╤══════════════════════════╤════════╤═════════╕ │ path │ duration │ avg_duration │ size │ count │ ╞═══════════╪═════════════════╪══════════════════════════╪════════╪═════════╡ │ Aggregate │ 18 years, 3 │ 1 hour and 22.92 minutes │ │ 115987 │ │ │ months, 17 days │ │ │ │ │ │ and 22 hours │ │ │ │ ╘═══════════╧═════════════════╧══════════════════════════╧════════╧═════════╛
hackernews_only_direct.tw.db is a subset (including only direct URLs; excluding playlist URLs). It is a bit smaller but still indexes over 18 years of content. edit: only 5 years of content with >7 hn score
Zenodo vs GitHub TubeWatch database
I had to remove some records so that the file would fit in GitHub. See HN.tw on Github for technical details.
The original was uploaded to zenodo.
NB: The zenodo version does not contain all metadata (subtitles, etc) either. For that, after downloading either the zenodo or GitHub version, you would need to run:
library tubeupdate --extra hackernews.tw.db # this will likely take several days library optimize hackernews.tw.db # optional: this will build an fts index
Recipes
Prep
pip install xklb aiohttp
library hnadd hn.db -v # takes a few days to download 34 million records but you can press ctrl+c when your patience catalyzes
For reference, it took 22 minutes to catch up to the latest 150,000 comments and stories since the last time I ran it. The file is 24GB now.
I was able to compress the whole database down to 11GB using zstd. You can download it here via zenodo.
hn_story_nouns / hn_comment_nouns
sed -i -e 1,(sed ‘/^ 2 /Q’ hn_story_common_domains | count)d hn_story_common_domains # remove boring unique values” dir=”auto”>
function domains awk -F / '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." $argv end cat hn_story_urls | domains | asc > hn_story_common_domains sed -i -e 1,(sed '/^ 2 /Q' hn_story_common_domains | count)d hn_story_common_domains # remove boring unique values
hackernews tubewatch database
JWPlatform, Imgur, CBSNewsEmbed, CNN, Viidea, etc did not seem to work after a few tests:
lb-dev watch ~/github/xk/hn_mining/hackernews.tw.db -w "ie_key = 'JWPlatform'" https://www.businessinsider.com/australia-face-scan-porn-home-affairs-internet-2019-10 Player exited with code 2
- ie_key WSJ and FoxNews loaded nearly every time but it was always a video that didn’t relate to the article lol… NBCNews also suffered from this sometimes
- ie_key Twitter would hang often or not have a video but I did see this really cool thing! edit: link removed lol. the URL pointed to something else. same problem as above.
- ie_key WashingtonPost, CBCPlayer do not save a valid URL id so all of that data was impossible to pipe to mpv; ie_key iheartradio as well to some extent
- ie_key SafariApi just seemed to point to Oreilly books
- holy cow batman CCC and InfoQ videos load fast. They must have a 16 core PC with a 10 bit pipe?
DELETE FROM media WHERE ie_key IN ('JWPlatform', 'Imgur', 'WSJ', 'Twitter', 'WashingtonPost', 'CBSNewsEmbed', 'CNN', 'SafariApi', 'Viidea', 'NBCNews', 'FoxNews','NineCNineMedia', 'Mixcloud', 'CBCPlayer', 'LinkedIn','AmazonStore', 'Spotify'); DELETE FROM media WHERE playlist_path LIKE '%bostonglobe.com%'; ALTER TABLE media DROP COLUMN tags;
sqlite-utils disable-fts hackernews.tw.db media sqlite-utils vacuum hackernews.tw.db zstd hackernews.tw.db
Recipe hackernews_only_direct.tw.db
cp hackernews.tw.db hackernews_only_direct.tw.db
sqlite-utils hackernews_only_direct.tw.db 'delete from media where playlist_path in (select path from playlists)'
sqlite3 ATTACH 'hackernews_only_direct.tw.db' AS h; ATTACH '/home/xk/lb/hn.db' AS hn; create table hn_score as select path, max(score) as score from hn_story where path in (select path from h.media) group by 1 ; alter table h.media add column score int; UPDATE h.media SET score = ( SELECT score FROM hn_score WHERE hn_score.path = h.media.path ) WHERE EXISTS( SELECT 1 FROM hn_score WHERE hn_score.path = h.media.path ); -- this takes about eight minutes
Mistakes
If you have a careful eye you will notice that the results are not perfect. Here are some weird things I noticed which are probably my fault:
mistakes, hn_story_common_domains
- duplicates (or maybe this is a bug in uniq (GNU coreutils) 9.0 ?)
- co.uk and other two letter subdomains are missing from my awk query
mistakes, nouns
- ” is included (fixed with v1.19.027) but I’m too lazy to update the data in this repo unless someone submits a PR here
mistakes, tubewatch
- In hindsight I probably should’ve filtered out some spam by using
is_dead IS NULLor something






