How to choose your field?
You should observe experienced scientists
How did you choose your scientific field?
Was it a careful process? Could you give quantitative reasons at the time? If you're like me, chances are it was a bit of a random walk, influenced by what courses you liked, a specific inspiring high school teacher, or encouragement from friends and family.
Is there a better way to find meaningful and important work?
In particular, what quantitative information does a young scientist need?
My hypothesis: Experienced scientists who switch fields are a strong signal for what is important science.
Here are the reasons:
Scientists who switch fields chose to give up an established track record and head start in their previous field, because they thought switching fields was that important.
They made the conscious decision to invest time learning the technical details of a new field at first, understanding important problems, and then doing research.
Scientists who have switched fields have a unique and valuable perspective compared to the average scientist.
They have deep knowledge of both fields, and are actually able to weigh the relative merits of the two fields, unlike the average scientist working only in one field.
So the question becomes: How can we measure scientists switching fields?
The dataset: author publishing trends on ArXiv
Here, I measure author trends over 2.5 million papers published in the last 3 decades on the preprint server ArXiv (widely used by math, physics, computer science, etc).
On ArXiv, each paper is published in a certain subject category (or a few categories). This lets us track which fields a given author publishes in each year.
You can find more details of the analysis at the end of this post, and in my Github project.
Update April 2023: See Our World in Data website for a lot of useful similar data on AI publications and statistics.
What was measured (click to expand)
Author number in each field in each year
Each author is someone who at least one paper in some field in a given year, identified by unique names on ArXiv (which is not perfect).
If an author publishes in multiple fields in a given year, they are counted as a fraction of an author in each field, so that their total contribution to the author count is 1.
I measure authorship based solely on the union of all fields that an author publishes in within a given year. It is not dependent on the relative number of publications in different fields for a given author, or on the number of co-authors on different papers. This helps control for differences in publishing practices in different fields.
The goal here is to extract a metric that reveals switching behavior. Authors who make sustained transitions between fields will show up as +1 author in the new field.
Net transition count per year of authors who switch from one field to another
The net transition count from field A to field B is how many authors who were in field A switch to a new field B in a specific year, minus the count of authors who switched in the reverse direction during that year.
I calculate this in a way that accounts for fractional changes in authorship (straddling multiple fields). See Appendix 1 for details.
If an author makes a sustained transition from field A to field B in year X, I measure a transition count of +1 from A to B in year X only.
However, if someone switches back and forth multiple times, I count multiple transitions.
If someone switches gradually over multiple years, they will contribute a fractional transition during each of those years.
I choose to report the net transition count (the count of transitions in the forward direction minus the count in the reverse direction), because it is the ultimate relevant quantity.
This is a better metric than the actual count in one direction because, if an author switched from A to B every year, they would (on average) contribute 0 to the net transition count (and this is the "correct" interpretation, as they are really just someone who straddles both fields, but never decides to change their average behavior). However, if one reported just the transition count in one direction, it would look deceptively large because of this author. Measuring net transitions seems more appropriate.
Net transition rate
This is just the net transition count between two fields in a given year, divided by the size of the source field in that year (the author count in the source field). It is the net flow out or in, per person in the field.
Intuitively this is meant to capture "probability I would switch if I were in this field", loosely speaking.
Cumulative net transition count
This is the sum of the historical net transition count per year, over all previous years.
It measures approximately how many total people have made a specific transition, up to that time. (Note, however, that it uses the net rate.)
Results: Overall trends
First, here is a summary figure showing the major trends in the dataset.
Pay attention to the right panel, which shows the cumulative net transition count in and out of different fields (this ignores completely new authors, and only measures existing authors who switch).
Fig. 1. Authorship trends on ArXiv in each major field, and cumulative net transitions by existing authors between fields.
Comments on major trends
There is an explosion in the absolute number of authors in AI related fields around 2016.
You can identify losing fields (condensed matter physics), winning fields (AI, general physics), and neutral fields (astrophysics, with little net inflow or outflow of existing authors)
You can see relevant orders of magnitude:
The absolute and relative number of publishing authors in various fields (it's not exactly precise, see caveats below).
If a large fraction of authors in a given field end up leaving, that is something to be aware of.
If you are considering a career transition, you can use this data to gauge whether you will be a person with unique cross-disciplinary knowledge (after switching into a field).
Results: Details within fields
We can also measure how people move between specific fields, to begin to answer questions like:
"The field of AI grew significantly, but which fields did people come from, and when did they make their transition?"
Fig. 2, Fig. 3, and Fig. 4 show author switching trends for a few specific fields, including AI, math, and condensed matter physics (if you are interested in other fields, see Appendix 3).
AI related fields (click to expand)
Highlights:
There is a massive explosion in author counts and switching rates beginning in 2016-2017,
The peak in switching rates occurs around 2019 and is now coming down (perhaps surprisingly).
Around 20% of people in AI now switched from other fields (on net).
Of those people who switched, about 1/3 came from physics, math, or computer science each.
It's interesting to speculate where these trends go into the future.
Fig. 2. Authorship trends for AI related fields.
(a) Total author number (black), and cumulative author number who have switched into AI related fields (blue). Around 20% of authors are people who have switched into AI (while remaining within ArXiv), while most current authors are new authors.
(b) Cumulative number of people who have switched into AI related fields from various other fields. Around 6,000 come from either physics, math, or computer science (non AI) each.
(c) Author transition number per year into AI related fields from other fields. This is the total number of authors who previously published in another field, and completed a transition into AI in that year. Interestingly, it peaks in around 2019.
(d) Transition rate into AI from other fields. This is normalized to be a "per author" metric. It is (c) divided by the total author number in AI in a given year, the black curve in (a). The fraction of people entering AI fields from other fields peaked around 1%-3% in 2019-2020.
In summary:
(a) shows overall behavior. (b) decomposes the blue curve from (a) into it's different components. (c) is the derivative of (b), and (d) is (c) with a convenient normalization.
Math
Highlights:
Math had essentially no significant switching trends until the AI bubble, when a large fraction (around 10-15%) switched to AI fields.
Fig. 3. Authorship trends for math.
Condensed matter physics
Highlights:
Generally, people have been switching out of condensed matter physics for a decade, with up to 1/3 of people leaving.
People flow in from nuclear and high energy physics, and out to a variety of fields, but mostly the designation "general physics"
Fig. 4. Authorship trends for condensed matter physics.
Summary and reflections
We have looked at trends in scientific authorship over the last few decades, focusing on how existing authors move between fields, as a signal for value and importance of different fields. This has revealed trends in overall preference, as well as fine-grained trends for switching between pairs of fields, year by year.
Reflections
First and foremost, this data is something any scientist should be aware of, and I was surprised that I couldn't find it anywhere.
A career choice is not made in vacuum; it matters what "market pressures" exist on people doing science. Seeing historical data can help understand future expectations.
The data shown here are of course not sufficient to make a career choice, as it is a very personal decision with many factors.
You can use this data to keep an eye out for future trends.
For example, you probably could have seen the extreme behavior related to AI already in 2018, and reflected on what that might mean.
You can analyze trends within your specific field, like condensed matter physics.
ArXiv provides more detailed category designations. In my Github files, you can break down transitions between these categories, to see how people transition within your field. This could be useful for making fine-grained decisions on what fields are growing, and which are dying.
Appendix 1: Additional measurement details (click to expand)
Here I provide a longer summary of some of the analysis methods.
In this post, I measure author trends on the preprint server ArXiv (widely used by math, physics, computer science, etc). The data is from a Kaggle dataset of all 2.5 million published papers, including author names, and the categories of their subject field. For all data analysis, and plots, and to mess with the data yourself, see my Github.
On ArXiv, each paper is published in a certain subject category (or a few categories). This lets us track which fields a given author publishes in each year.
Metrics
Author number in each field
If a specific author publishes in "stat.ML", "astro-ph", and "cond-mat.quant-gas" in a given year, then I count them as 1/3 of an author in each field during that year.
This is not dependent on paper number. If they publish 5 papers in stat.ML, and one paper that is cross listed in astro-ph and cond-mat.quant-gas, then I still count them as 1/3 in each. I take the union of all published categories.
To get the full author count in a field during a year, I sum all of the contributions from individual authors.
Total transition count
Intuitive description: this is the number of authors in a given year who publish in a new category for the first time in that year.
I track each specific author over time, and where they publish. When they change their distribution, I count that as a transition for the year they changed. I essentially treat the transition as uniformly distributed over gains and losses. I'll describe what I mean by that now:
Example configuration:
Imagine that an author
Gains 0.1 in A, and 0.2 in B
Loses 0.05 in C, 0.05 in D, and 0.2 in E (note these sum to 0)
If I treat this transition as a random process, there is a 0.1 probability this person transitions into A, and if they do so, there is a probability of 1/6 that they came out of C, 1/6 they came out of D, and 2/3 that they came out of E.
Thus I count a transition event of
C to A of amplitude 0.1/6
D to A of amplitude 0.1/6
E to A of amplitude 0.1/2
C to B of amplitude 0.2/6
...
Hopefully this is clear, and intuitively captures what it means to make a fractional transition.
To get the total transition count, I just sum up the transition matrix between all sources and target fields for each author in a given year.
I actually report the net transition rate in a given year (subtract the reverse measured transition counts), so this can be negative sometimes.
Transition rate per year
This is just the total transition count between two fields in a given year, divided by the size of the source field (author count in the source field).
Intuitively this is representing something like the "probability I would switch out (or in) if I were in this field".
Cumulative total transition count
This is just the sum of the historical transition events between fields in each year.
It should give a decent representation of the total number of people who have switched between fields, over time.
Note: take care interpreting this. If people switch between fields many times in a big cycle, the overall transition rates could look large, but not mean quite what you think.
How I combined published fields
Note: as part of the analysis, I collapse various categories on Arxiv, ('astro-ph.CO': 'Cosmology and Nongalactic Astrophysics', and 'astro-ph.EP': 'Earth and Planetary Astrophysics') into one category, like "astrophysics". This is a choice I made.
Appendix 2: Caveats and pitfalls of this analysis (click to expand)
A few details to be aware of.
Fundamental assumption that switching = value
I am assuming that switching fields is a good indicator of value, but people may also switch for other reasons.
Research in a field may be harder or easier than expected.
People may switch just to "follow the crowd" rather than as a carefully thought out plan. People may feel a pressure to label their work as "the hot new field" even if it's the same type of work they have always been doing.
ArXiv isn't complete (not representative of all scientific fields)
This is only ArXiv publishing data. Ideally I would want to include more databases across a broader range of fields (ie. bioRxiv), and characterize the paper categories in a more general way.
Different fields have different publishing behaviors
I have tried to make the analysis robust to the publishing behaviors of different fields, but I may have missed something. I have tried to make it so that the moment an author switches (really switches) is measured as a transition of 1 author during that year.
It should not matter whether authors publish in small groups, or large groups (many co-authors).
It should not matter whether authors publish once every 5 years or many times per year.
There is one problem, which is I may be undercounting the author number in a field if everyone only publishes every 5 years (because I count the authors publishing in a given year only, so I would count roughly 1/5 the true author number lurking in the background).
However, the relative transition counts (the transition rates) measured are normalized to the author number, and should be robust (to this choice of 1 year vs. 5 year binning). If you want a more realistic author number, you could change the binning to 5 year size.
Author names are not a good way to identify unique authors
I use each author name as a unique string to identify an author, but this isn't correct. Some people have the same name, and I will therefore count them as the same person. To avoid this to some degree, I don't count any authors with more than 100 publications. This should help, but will then mess up the total author number. In the end, it's a tradeoff, and would be better to use something like Orchid identifiers.
Using names, which people share, could give the illusion of people switching fields if a new person appears who is in a growing field, with the same name as an older person in another field. But the fact that I don't see transitions from astro to other places might mean it's fine.
I made specific choices on how to group scientific field designations
I made specific choices on which archive subfields to consider "AI fields" and which to classify together as quantum physics, for example. Changing this will change conclusions slightly. This is slightly more complicated than it first seems: for example, astrophysics had a change of it's ArXiv organizations at one point, so there's massive flux between fields within ArXiv as people adjusted.
I measure net author switching rates
One might argue that the absolute switching rates are a more relevant metric. For example, if the field of AI is growing at rate X, but what's actually happening is people are transitioning into AI with rate 3X and out with rate 2X, that's probably worth knowing. However, it's not obvious how to measure this well (since it depends on the timing bin size you choose to measure transitions on). The net rate is nicely independent of the timing bin size.
Appendix 3: Data by field
See also my Github project, to easily generate and play with these graphs, or alter the analysis in new ways.