Hey, Twitter what preprints are you reading?

2018-04-13, rev. 2018-04-24 data-vis twitter

Introduction

I find a lot of valuable pointers to interesting arXiv preprints via Twitter. Of course, this is only limited to people that I follow. However, I was always interested what is the WHOLE Twitter reading. Luckily Twitter API allows for such an exploration…

The data

To analyse this in a greater detail, I set up a little tool that fetches the tweets mentioning arXiv, live via Twitter streaming API. This was easy, thanks to the cool library Tweepy. I left it running between 2018.03.21 and 2018.04.12 (~22 days), gathering 68 262 tweets linking to arXiv preprints. Not bad.

Of course we need also papers metadata from arXiv. Luckily, this is also not overly complicated. The data can be harvested via OAI-PMH API, provided by arXiv. The tweets mention 14 169 arXiv papers. Quite a few…

The tweets data used in this post are available here. Complementary json data for arXiv are here (a bit redundant as they contain metadata from arXiv in all supported formats, i.e., arXiv, arXivRaw, and oai-dc).

So without further ado, what were the most tweeted papers during these 22 days?

Here comes the top ten!

Ranking Tweets Spec Title Pdf
1 794 cs/stat Learning Memory Access Patterns here
2 673 cs/stat World Models here
3 603 cs Image Generation from Scene Graphs here
4 570 cs DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills here
5 555 astro-ph The First Naked-Eye Superflare Detected from Proxima Centauri here
6 518 cs MemGEN: Memory is All You Need here
7 503 cs Group Normalization here
8 442 cs/stat On the importance of single directions for generalization here
9 408 cs Generative Adversarial Networks for Extreme Learned Image Compression here
10 345 cs On the Algebra in Boole’s Laws of Thought here

Is it a paper or is it a tweet?

The above top ten list is nice, but it would be great to see how the preprint got popular. Is it more the “effect of the paper”, which gets mentioned in many different tweets. Alternatively, it might be fully the effect of viral tweet (“the tweet effect”). In this scenario, we find the preprint high on the list, because it was mentioned in a single very popular message.

Let us have a look at the two sample papers from our top ten list.

  • World Models was tweeted 672 times in our dataset. This includes the following contributions:

    • 527, Top tweet #1

      Our work on “World Models” is out!Can neural network agents dream, and learn inside of their own dream worlds? Read more to find out: Full Interactive Article: link PDF: link

    • 67, Top tweet #2

      Fantastic work from @hardmaru on agents learning inside of their own mental models of environments. It is a pleasure of a read. “Can neural network agents dream, and learn inside of their own dream worlds?” - link Do I love the ancient drawing, or what!

    • 37, Top tweet #3

      So…I just read the “World Models” paper link from Ha & Schmidhuber. This is a nicely written, well researched paper with some cool/fun results. It also has a solid related work section and does a decent job putting the work into context.

    • The rest of tweets had counts lower than 20 (42 tweets)

Clearly, the main component was the tweet of the author, with some much less popular “wows” from others.

The case of the astrophysics paper is slightly more nuanced.

  • The First Naked-Eye Superflare Detected from Proxima Centauri was tweeted 555 times in our dataset. This includes the following contributions:

    • 322, Top tweet #1

      Excited to announce the largest flare ever seen from Proxima Centauri, during which Proxima became a naked-eye-bright star and released germicidal levels of UV at Proxima b’s surface (assuming the flare and planet were on same side of the star) link

    • 89, Top tweet #2

      Proxima Centauri b is the nearest planet to us that appears to be in the ‘habitable zone” of its star, which is exciting. Hopes for life there however have diminished, as the observations now show stellar flares that would likely sterilize the planet. link

    • 48, Top tweet #3

      A huge flare briefly increased Proxima Centauri’s brightness by a factor of 68. Could life exist on the Earth-size planet around such an unstable star? link

    • 47, Top tweet #4

      Proxima Centauri just released a flare so powerful it was visible to the unaided eye. Planets there would get scorched - link

    • The rest of tweets had counts lower than 20 (49 tweets)

Again there is a very strong contribution from the tweet #1, actually announcing the paper. However, there are significant addition to the main popularity wave by tweet #3 and #4. Tweet #2 produces its own popularity spike two days after the main wave.

Generally, the most part of the popularity of the top preprints in the analysed data can be attributed to very few “killer” tweets.

Which arXiv specs are popular on Twitter?

In the top ten list above, except the title, I provided the arXiv spec, i.e., the arXiv category classification.

List contains mostly cs (computer science) with majority of deep learning papers. However, there is also one highly tweeted paper from astro-ph

This rises natural question — how popular are preprints from disciplines (specs) other than cs on Twitter?

Let us have a look at number of tweets mentioning paper belonging to a particular spec. Note that paper can be attributed to multiple disciplines (for example first two top papers above are both cs and stat). I count them to both disciplines in such a case (no fractional weights). Top ten scoring specs are here:

Ranking Tweets Spec
1 41063 cs
2 12949 stat
3 11802 math
4 6032 astro-ph
5 5833 physics
6 3984 cond-mat
7 3009 hep-th
8 2632 hep-ph
9 2499 quant-ph
10 1941 q-bio

OK, we have cs, stat, math wining, but there are also tweets on various flavours of physics and also q-bio. So, the computer scientists seem to be the early adopters, but the technology wave is creeping to other disciplines as well…

Comparison of time line behaviour in different disciplines

Another interesting issue would be to look at time line activity for tweets, related to particular spec. Take, for instance stat, and have a look at histogram (bins are every two hours).

We see a little bit lower traffic during weekends (grey area), also some spikes related to bots.

Comparing stat with math, which has similar popularity, can be insightful. Let us have a look:

Note that this time spikes are much higher than in the previous case. On a closer inspection, it turns out that, surprisingly, there are a lot of bots monitoring the arXiv mathematical domains. For example there is family of arXiv_math… bot accounts (e.g., arXiv_math_AP and another family of math… (e.g. mathAPb) monitoring the same preprint topics. Hence, mathematical arXiv world on Twitter is heavily dehumanised (botified?) …

Next interesting question — what do people hashtag along with the preprints?

We have to make this discipline specific, otherwise results will be dominated by cs. Top five hashtags for a few disciplines are here:

  • cs — DeepLearning (1420), AI (1294), NeuralNetworks (958), ML (852), BigData (835)
  • math — math (62), harrypotter (58), science (47), MachineLearning (38), STEM (24)
  • astro-ph — TTauri (303), flareddisks (303), protoplanetaryDisks (303), rings (303), thread (135)
  • physics — arXiv (147), DeleteFacebook (136), APD (120), Consultant (120), Design (120)
  • hep-ph — DarkMatter (70), HepPh (65), arXiv (30), Neutrino (23), AstroPhCO (22)

For cs these are the usually hyped suspects. In other disciplines, people do not seem to add a lot of hashtags to their tweets. However, the seeds of the domain specific hashtag vocabulary are definitely there (#DarkMatter, #Neutrino, #protoplanetaryDisks, etc.)

Peculiarities — #DeleteFacebook in physics was along with this paper, #harrypotter comes from the bot HogwartsScience.

To ABS or to PDF?

Now, final crazy question — do people link more often to the abstract landing page on arXiv or directly to the PDF? Views on what one should do vary, see this thread. What are the statistics?

  • 54749 links were pointing to abstracts
  • 14204 links were pointing to directly to PDFs

Summary

I believe, the usage of Twitter in scientific community will increase.

In computer science, especially deep/machine learning and statistics it is already there. Other disciplines have a fair chance of adopting it in the nearest future.

In this post, I had a look on some interesting aspects of Twitter usage across disciplines. However, this was definitely only scratching a surface.

The good news is that there are already first search tools for papers/preprints, taking into account Twitter data. Have a look at Andrej Karpathy’s arxiv-sanity top hype or Semantic Scholar.

The data and code for reproducing the results in this post are here.