Learning the Teochew (Chaozhou) Dialect

Lately I’ve been learning my girlfriend’s dialect of Chinese, called the Teochew dialect.  Teochew is spoken in the eastern part of the Guangdong province by about 15 million people, including the cities of Chaozhou, Shantou, and Jieyang. It is part of the Min Nan (闽南) branch of Chinese languages.

teochew-map

Above: Map of major dialect groups of Chinese, with Teochew circled. Teochew is part of the Min branch of Chinese. Source: Wikipedia.

Although the different varieties of Chinese are usually refer to as “dialects”, linguists consider them different languages as they are not mutually intelligible. Teochew is not intelligible to either Mandarin or Cantonese speakers. Teochew and Mandarin diverged about 2000 years ago, so today they are about as similar as French is to Portuguese. Interestingly, linguists claim that Teochew is one of the most conservative Chinese dialects, preserving many archaic words and features from Old Chinese.

Above: Sample of Teochew speech from entrepreneur Li Ka-shing.

Since I like learning languages, naturally I started learning my girlfriend’s native tongue soon after we started dating. It helped that I spoke Mandarin, but Teochew is not close enough to simply pick up by osmosis, it still requires deliberate study. Compared to other languages I’ve learned, Teochew is challenging because very few people try to learn it as a foreign language, thus there are few language-learning resources for it.

Writing System

The first hurdle is that Teochew is primarily spoken, not written, and does not have a standard writing system. This is the case with most Chinese dialects. Almost all Teochews are bilingual in Standard Chinese, which they are taught in school to read and write.

Sometimes people try to write Teochew using Chinese characters by finding the equivalent Standard Chinese cognates, but there are many dialectal words which don’t have any Mandarin equivalent. In these cases, you can invent new characters or substitute similar sounding characters, but there’s no standard way of doing this.

Still, I needed a way to write Teochew, to take notes on new vocabulary and grammar. At first, I used IPA, but as I became more familiar with the language, I devised my own romanization system that captured the sound differences.

Cognates with Mandarin

Knowing Mandarin was very helpful for learning Teochew, since there are lots of cognates. Some cognates are obviously recognizable:

  • Teochew: kai shim, happy. Cognate to Mandarin: kai xin, 开心.
  • Teochew: ing ui, because. Cognate to Mandarin: ying wei, 因为

Some words have cognates in Mandarin, but mean something slightly different, or aren’t commonly used:

  • Teochew: ou, black. Cognate to Mandarin: wu, 乌 (dark). The usual Mandarin word is hei, 黑 (black).
  • Teochew: dze: book. Cognate to Mandarin: ce, 册 (booklet). The usual Mandarin word is shu, 书 (book).

Sometimes, a word has a cognate in Mandarin, but sound quite different due to centuries of sound change:

  • Teochew: hak hau, school. Cognate to Mandarin: xue xiao, 学校.
  • Teochew: de, pig. Cognate to Mandarin: zhu, 猪.
  • Teochew: dung: center. Cognate to Mandarin: zhong, 中.

In the last two examples, we see a fairly common sound change, where a dental stop initial (d- and t-) in Teochew corresponds to an affricate (zh- or ch-) in Mandarin. It’s not usually enough to guess the word, but serves as a useful memory aid.

Finally, a lot of dialectal Teochew words (I’d estimate about 30%) don’t have any recognizable cognate in Mandarin. Examples:

  • da bo: man
  • no gya: child
  • ge lai: home

Grammatical Differences

Generally, I found Teochew grammar to be fairly similar to Mandarin, with only minor differences. Most grammatical constructions can transfer cognate by cognate and still make sense in the other language.

One significant difference in Teochew is the many fused negation markers. Here, a syllable starts with the initial b- or m- joined with a final to negate something. Some examples:

  • bo: not have
  • boi: will not
  • bue: not yet
  • mm: not
  • mai: not want
  • ming: not have to

Phonology and Tone Sandhi

The sound structure of Teochew is not too different from Mandarin, and I didn’t find it difficult to pronounce. The biggest difference is that syllables may end with a stop: -t, -k, -p, and -m, whereas Mandarin syllables can only end with a vowel or nasal. The characteristic of a Teochew accent in Mandarin is replacing /f/ with /h/, and indeed there is no /f/ sound in Teochew.

The hardest part of learning Teochew for me were the tones. Teochew has either six or eight tones depending on how you count them, which isn’t difficult to produce in isolation. However, Teochew has a complex system of tone sandhi rules, where the tone of each syllable changes depending on the tone of the following syllable. Mandarin has tone sandhi to some extent (for example, the third tone sandhi rule where nǐ + hǎo is pronounced níhǎo rather than nǐhǎo). But Teochew takes this to a whole new level, where nearly every syllable undergoes contextual tone change.

Some examples (the numbers are Chao tone numerals, with 1 meaning lowest and 5 meaning highest tone):

  • gu5: cow
  • gu1 nek5: beef

Another example, where a falling tone changes to a rising tone:

  • seng52: to play
  • seng35 iu3 hi1: to play a game

There are tables of tone sandhi rules describing in detail how each tone gets converted to what other tone, but this process is not entirely regular and there are exceptions. As a result, I frequently get the tone wrong by mistake.

Resources for Learning Teochew

Teochew is seldom studied as a foreign language, so there aren’t many language learning resources for it. Even dictionaries are hard to find. One helpful dictionary is Wiktionary, which has the Teochew pronunciation for most Chinese characters.

Also helpful were formal linguistic grammars:

  1. Xu, Huiling. “Aspects of Chaoshan grammar: A synchronic description of the Jieyang dialect.” Monograph Series Journal of Chinese Linguistics 22 (2007).
  2. Yeo, Pamela Yu Hui. “A sketch grammar of Singapore Teochew.” (2011).

The first is a massively detailed, 300-page description of Teochew grammar, while the second is a shorter grammar sketch on a similar variety spoken in Singapore. They require some linguistics background to read. Of course, the best resource is my girlfriend, a native speaker of Teochew.

Visiting the Chaoshan Region

After practicing my Teochew for a few months with my girlfriend, we paid a visit to her hometown and relatives in the Chaoshan region. More specifically, Raoping County located on the border between Guangdong and Fujian provinces.

Left: Chaoshan railway station, China. Right: Me learning the Gongfu tea ceremony, an essential aspect of Teochew culture.

Teochew people are traditional and family oriented, very much unlike the individualistic Western values that I’m used to. In Raoping and Guangzhou, we attended large family gatherings in the afternoon, chatting and gossiping while drinking tea. Although they are still Han Chinese, the Teochew consider themselves a distinct subgroup within Chinese, with their unique culture and language. The Teochew are especially proud of their language, which they consider to be extremely hard for outsiders to learn. Essentially, speaking Teochew is what separates “ga gi nang” (roughly translated as “our people”) from the countless other Chinese.

My Teochew is not great. Sometimes I struggle to get the tones right and make myself understood. But at a large family gathering, a relative asked me why I was learning Teochew, and I was able to reply, albeit with a Mandarin accent: “I want to learn Teochew so that I can be part of your family”.

raoping-sea

Above: Me, Elaine, and her grandfather, on a quiet early morning excursion to visit the sea. Raoping County, Guangdong Province, China.

Thanks to my girlfriend Elaine Ye for helping me write this post. Elaine is fluent in Teochew, Mandarin, Cantonese, and English.

Clustering Autoencoders: Comparing DEC and DCN

Deep autoencoders are a good way to learn representations and structure from unlabelled data. There are many variations, but the main idea is simple: the network consists of an encoder, which converts the input into a low-dimensional latent vector, and a decoder, which reconstructs the original input. Then, the latent vector captures the most essential information in the input.

autoencoder

Above: Diagram of a simple autoencoder (Source)

One of the uses of autoencoders is to discover clusters of similar instances in an unlabelled dataset. In this post, we examine some ways of clustering with autoencoders. That is, we are given a dataset and K, the number of clusters, and need to find a low-dimensional representation that contains K clusters.

Problem with Naive Method

An naive and obvious solution is to take the autoencoder, and run K-means on the latent points generated by the encoder. The problem is that the autoencoder is only trained to reconstruct the input, with no constraints on the latent representation, and this may not produce a representation suitable for K-means clustering.

naive-failure

Above: Failure example with naive autoencoder clustering — K-means fails to find the appropriate clusters

Above is an example from one of my projects. The left diagram shows the hidden representation, and the four classes are generally well-separated. This representation is reasonable and the reconstruction error is low. However, when we run K-means (right), it fails spectacularly because the two latent dimensions are highly correlated.

Thus, our autoencoder can’t trivially be used for clustering. Fortunately, there’s been some research in clustering autoencoders; in this post, we study two main approaches: Deep Embedded Clustering (DEC), and Deep Clustering Network (DCN).

DEC: Deep Embedded Clustering

DEC was proposed by Xie et al. (2016), perhaps the first model to use deep autoencoders for clustering. The training consists of two stages. In the first stage, we initialize the autoencoder by training it the usual way, without clustering. In the second stage, we throw away the decoder, and refine the encoder to produce better clusters with a “cluster hardening” procedure.

dec-model

Above: Diagram of DEC model (Xie et al., 2016)

Let’s examine the second stage in more detail. After training the autoencoder, we run K-means on the hidden layer to get the initial centroids \{\mu_i\}_{i=1}^K. The assumption is the initial cluster assignments are mostly correct, but we can still refine them to be more distinct and separated.

First, we soft-assign each latent point z_i to the cluster centroids \{\mu_i\}_{i=1}^K using the Student’s t-distribution as a kernel:

q_{ij} = \frac{(1 + ||z_i - \mu_j||^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j'} (1 + ||z_i - \mu_{j'}||^2 / \alpha)^{-\frac{\alpha+1}{2}}}

In the paper, they fix \alpha=1 (the degrees of freedom), so the above can be simplified to:

q_{ij} = \frac{(1 + ||z_i - \mu_j||^2)^{-1}}{\sum_{j'} (1 + ||z_i - \mu_{j'}||^2)^{-1}}

Next, we define an auxiliary distribution P by:

p_{ij} = \frac{q_{ij}^2/f_j}{\sum_{j'} q_{ij'}^2 / f_{j'}}

where f_j = \sum_i q_{ij} is the soft cluster frequency of cluster j. Intuitively, squaring q_{ij} draws the probability distribution closer to the centroids.

p-and-q-distributions

Above: The auxiliary distribution P is derived from Q, but more concentrated around the centroids

Finally, we define the objective to minimize as the KL divergence between the soft assignment distribution Q and the auxiliary distribution P:

L = KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

Using standard backpropagation and stochastic gradient descent, we can train the encoder to produce latent points z_i to minimize the KL divergence L. We repeat this until the cluster assignments are stable.

DCN: Deep Clustering Network

DCN was proposed by Yang et al. (2017) at around the same time as DEC. Similar to DEC, it initializes the network by training the autoencoder to only reconstruct the input, and initialize K-means on the hidden representations. But unlike DEC, it then alternates between training the network and improving the clusters, using a joint loss function.

dcn-model

Above: Diagram of DCN model (Yang et al., 2017)

We define the optimization objective as a combination of reconstruction error (first term below) and clustering error (second term below). There’s a hyperparameter \lambda to balance the two terms:

dcn-loss

This function is complicated and difficult to optimize directly. Instead, we alternate between fixing the clusters while updating the network parameters, and fixing the network while updating the clusters. When we fix the clusters (centroid locations and point assignments), then the gradient of L with respect to the network parameters can be computed with backpropagation.

Next, when we fix the network parameters, we can update the cluster assignments and centroid locations. The paper uses a rolling average trick to update the centroids in an online manner, but I won’t go into the details here. The algorithm as presented in the paper looks like this:

dcn-algorithm

Comparisons and Further Reading

To recap, DEC and DCN are both models to perform unsupervised clustering using deep autoencoders. When evaluated on MNIST clustering, their accuracy scores are comparable. For both models, the scores depend a lot on initialization and hyperparameters, so it’s hard to say which is better.

One theoretical disadvantage of DEC is that in the cluster refinement phase, there is no longer any reconstruction loss to force the representation to remain reasonable. So the theoretical global optimum can be achieved trivially by mapping every input to the zero vector, but this does not happen in practice when using SGD for optimization.

Recently, there have been lots of innovations in deep learning for clustering, which I won’t be covering in this post; the review papers by Min et al. (2018) and Aljalbout et al. (2018) provide a good overview of the topic. Still, DEC and DCN are strong baselines for the clustering task, which newer models are compared against.

References

  1. Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
  2. Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
  3. Min, Erxue, et al. “A survey of clustering with deep learning: From the perspective of network architecture.” IEEE Access 6 (2018): 39501-39514.
  4. Aljalbout, Elie, et al. “Clustering with deep learning: Taxonomy and new methods.” arXiv preprint arXiv:1801.07648 (2018).

NAACL 2019, my first conference talk, and general impressions

Last week, I attended my first NLP conference, NAACL, which was held in Minneapolis. My paper was selected for a short talk of 12 minutes in length, plus 3 minutes for questions. I presented my research on dementia detection in Mandarin Chinese, which I did during my master’s.

My talk was recorded, but the organizers have not uploaded them yet, so I can’t share it right now. I’ll update this page with a link to the video when it’s ready.

Visiting Minneapolis

Going to conferences is a good way as a grad student to travel for free. Some of my friends balked at the idea of going to Minneapolis rather than somewhere more “interesting”. However, I had never been there before, and in the summer, Minneapolis was quite nice.

Minneapolis is very flat and good for biking — you can rent a bike for $2 per 30 minutes. I took the light rail to Minnehaha falls (above) and biked along the Mississippi river to the city center. The downside is that compared to Toronto, the food choices are quite limited. The majority of restaurants serve American food (burgers, sandwiches, pasta, etc).

Meeting people

It’s often said that most of the value of a conference happens in the hallways, not in the scheduled talks (which you can often find on YouTube for free). For me, this was a good opportunity to finally meet some of my previous collaborators in person. Previously, we had only communicated via Skype and email. I also ran into people whose names I recognize from reading their papers, but had never seen in person.

Despite all the advances in video conferencing technology, nothing beats face-to-face interaction over lunch. There’s a reason why businesses spend so much money to send employees abroad to conduct their meetings.

Talks and posters

The accepted papers were split roughly 50-50 into talks and poster presentations. I preferred the poster format, because you get to have a 1-on-1 discussion with the author about their work, and ask clarifying questions.

Talks were a mixed bag — some were great, but for many it was difficult to make sense of anything. The most common problem was that speakers tended to dive into complex technical details, and lost sense of the “big picture”. The better talks spent a good chunk of time covering the background and motivation, with lots of examples, before describing their own contribution.

It’s difficult to make a coherent talk in only 12 minutes. A research paper is inherently a very narrow and focused contribution, while the audience come from all areas of NLP, and have probably never seen your problem before. The organizers tried to group talks into related topics like “Speech” or “Multilingual NLP”, but even then, the subfields of NLP are so diverse that two random papers had very little in common.

Research trends in NLP

Academia has a notorious reputation for inventing impractically complex models to squeeze out a 0.2% improvement on a benchmark. This may be true in some areas of ML, but it certainly wasn’t the case here. There was a lot of variety in the problems people were solving. Many papers worked with new datasets, and even those using existing datasets often proposed new tasks that weren’t considered before.

A lot of papers used similar model architectures, like some sort of Bi-LSTM with attention, perhaps with a CRF on top. None of it is directly comparable to one another because everybody is solving a different problem. I guess it shows the flexibility of Bi-LSTMs to be so widely applicable. For me, the papers that did something different (like applying quantum physics to NLP) really stood out.

Interestingly, many papers did experiments with BERT, which was presented at this conference! Last October, the BERT paper bypassed the usual conventions and announced their results without peer review, so the NLP community knew about it for a long time, but only now it’s officially presented at a conference.

Hypothesis testing for difference in Pearson / Spearman correlations

The Pearson and Spearman correlation coefficients measure how closely two variables are correlated. They’re useful as an evaluation metric in certain machine learning tasks, when you want the model to predict some kind of score, but the actual value of the score is arbitrary, and you only care that the model puts high-scoring items above low-scoring items.

An example of this is the STS-B task in the GLUE benchmark: the task is to rate pairs of sentences on how similar they are. The task is evaluated using Pearson and Spearman correlations against the human ground-truth. Now, if model A has Spearman correlation of 0.55 and model B has 0.51, how confident are you that model A is actually better?

Recently, the NLP research community has advocated for more significance testing (Dror et al., 2018): report a p-value when comparing two models, to distinguish between true improvements and fluctuations due to random chance. However, hypothesis testing is rarely done for Pearson and Spearman metrics — it’s not mentioned in the hitchhiker’s guide linked above, and not supported by the standard ML libraries in Python and R. In this post, I describe how to do significance testing for a difference in Pearson / Spearman correlations, and give some references to the statistics literature.

Definitions and properties

The Pearson correlation coefficient is defined by:

r_{xy} = \frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_i^n (x_i - \bar{x})^2} \sqrt{\sum_i^n (y_i - \bar{y})^2}}

The Spearman rank-order correlation is defined as the Pearson correlation between the ranks of the two variables, and measures the relative order between them. Both correlation coefficients range between -1 and 1.

Pearson’s correlation is simpler, has nicer statistical properties, and is default option in most software packages. However, de Winter et al. (2016) argues that Spearman’s correlation works better with non-normal data and is more robust to outliers, so is generally preferred over Pearson’s correlation.

Significance testing

Suppose we have the predictions of model A and model B, and we wish to compute a p-value for whether their Pearson / Spearman correlation coefficients are different. Start by computing the correlation coefficients for both models against the ground truth.

Then, apply the Fisher transformation to each correlation coefficient:

z = \frac{1}{2} \log(\frac{1+r}{1-r})

This transforms r which is between -1 and 1 into z, which ranges the whole real number line. It turns out that z is approximately normal, with nearly constant variance that only depends on N (the number of data points) and not on r.

For Pearson correlation, the standard deviation of the estimator \hat r_p is given by:

\mathrm{SD}(\hat r_p) = \sqrt{\frac{1}{N-3}}

For Spearman rank-order correlation, the standard deviation of the estimator \hat r_s is given by:

\mathrm{SD}(\hat r_s) = \sqrt{\frac{1.060}{N-3}}

Now, we can compute the p-value because the difference of the two z values follows a normal distribution with known variance.

R implementation

The following R function computes a p-value for the two-tailed hypothesis test, given a ground truth vector and two model output vectors:

cor_significance_test <- function(truth, x1, x2, method="pearson") {
  n <- length(truth)
  cor1 <- cor(truth, x1, method=method)
  cor2 <- cor(truth, x2, method=method)
  fisher1 <- 0.5*log((1+cor1)/(1-cor1))
  fisher2 <- 0.5*log((1+cor2)/(1-cor2))
  if(method == "pearson") {
    expected_sd <- sqrt(1/(n-3))
  }
  else if(method == "spearman") {
    expected_sd <- sqrt(1.060/(n-3))
  }
  2*(1-pnorm(abs(fisher1-fisher2), sd=expected_sd))
}

Naturally, the one-tailed p-value is half of the two-sided one.

For details of other similar computations involving Pearson and Spearman correlations (eg: confidence intervals, unpaired hypothesis tests), I recommend the Handbook of Parametric and Nonparametric Statistical Procedures (Sheskin, 2000).

Caveats and limitations

The formula for the Pearson correlation is solid and very accurate. For Spearman, the constant 1.060 has no theoretical backing and was rather derived experimentally by Fieller et al. (1957), by running simulations using variables from a bivariate normal distribution. Fieller claimed that the approximation was accurate for correlations between -0.8 and 0.8. Borkowf (2002) warns that this approximation may be off if the distribution is far from a bivariate normal.

The procedure here for Spearman correlation may not be appropriate if the correlation coefficient is very high (above 0.8) or if the data is not approximately normal. In that case, you might want to try permutation tests or bootstrapping methods — refer to Bishara and Hittner (2012) for a detailed discussion.

References

  1. Dror, Rotem, et al. “The hitchhiker’s guide to testing statistical significance in natural language processing.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
  2. de Winter, Joost CF, Samuel D. Gosling, and Jeff Potter. “Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data.” Psychological methods 21.3 (2016): 273.
  3. Sheskin, David J. “Parametric and nonparametric statistical procedures.” Chapman & Hall/CRC: Boca Raton, FL (2000).
  4. Fieller, Edgar C., Herman O. Hartley, and Egon S. Pearson. “Tests for rank correlation coefficients. I.” Biometrika 44.3/4 (1957): 470-481.
  5. Borkowf, Craig B. “Computing the nonnull asymptotic variance and the asymptotic relative efficiency of Spearman’s rank correlation.” Computational statistics & data analysis 39.3 (2002): 271-286.
  6. Bishara, Anthony J., and James B. Hittner. “Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches.” Psychological methods 17.3 (2012): 399.

Why Time Management in Grad School is Difficult

Graduate students are often stressed and overworked; a recent Nature report states that grad students are six times more likely to suffer from depression than the general population. Although there are many factors contributing to this, I suspect that a lot of it has to do with poor time management.

In this post, I will describe why time management in grad school is particularly difficult, and some strategies that I’ve found helpful as a grad student.


As a grad student, I’ve found time management to be far more difficult than either during my undergraduate years as well as working in the industry. Here are a few reasons why:

  1. Loose supervision: as a grad student, you have a lot of freedom over how you spend your time. There are no set hours, and you can go a week or more without talking to your adviser. This can be both a blessing and a curse: some find the freedom liberating while others struggle to be productive. In contrast, in an industry job, you’re expected to report to daily standup, you get assigned tickets each sprint, so others essentially manage your time for you.
  2. Few deadlines: grad school is different from undergrad in that you have a handful of “big” deadlines a year (eg: conference submission dates, major project due dates), whereas in undergrad, the deadlines (eg: assignments, midterms) are smaller and more frequent.
  3. Sparse rewards: most of your experiments will fail. That’s the nature of research — if you know it’s going to work, then it’s no longer research. It’s hard to not get discouraged when you struggle for weeks without getting a positive result, and start procrastinating on a multitude of distractions.

Basically, poor time management leads to procrastination, stress, burnout, and generally having a bad time in grad school 😦


Some time management strategies that I’ve found to be useful:

  1. Track your time. When I first started doing this, I was surprised at how much time I spent doing random, half-productive stuff not really related to my goals. It’s up to you how to do this — I keep a bunch of Excel spreadsheets, but some people use software like Asana.
  2. Know your plan. My adviser suggested a hierarchical format with a long-term research agenda, medium-term goals (eg: submit a paper to ICML), and short-term tasks (eg: run X baseline on dataset Y). Then you know if you’re progressing towards your goals or merely doing stuff tangential to it.
  3. Focus on the process, not the reward. It’s tempting to celebrate when your paper gets accepted — but the flip side is you’re going to be depressed if it gets rejected. Your research will have have many failures: paper rejections and experiments that somehow don’t work. Instead, celebrate when you finish the first draft of your paper; reward yourself when you finish implementing an algorithm, even if it fails to beat the baseline.

Here, I plotted my productive time allocation in the last 6 months:

time_allocation.png

Most interestingly, only a quarter of my time is spent coding or running experiments, which seems to be much less than most grad students. I read a lot of papers to try to avoid reinventing things that others have already done.

On average, I spend about 6 hours a day doing productive work (including weekends) — a quite reasonable workload of about 40-45 hours a week. Contrary to some perceptions, grad students don’t have to be stressed and overworked to be successful; allowing time for leisure and social activities is crucial in the long run.

MSc Thesis: Automatic Detection of Dementia in Mandarin Chinese

My master’s thesis is done! Read it here:

MSc Thesis (PDF)

Video

Slides

Talk Slides (PDF)

Part of this thesis is replicated in my paper “Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus” which I will be presenting at NAACL 2019. However, the thesis contains more details and background information that were omitted in the paper.

Onwards to PhD!

Books I’ve read in 2018

I read 28 books in 2018 (about one every 2 weeks). Recently, I’ve been getting into the habit of taking notes in the margins and writing down a summary of what I learned after finishing them.

This blog post is a more-or-less unedited dump of some of my notes on some of the books I read last year. They were originally notes for myself and weren’t meant to be published, so a lot of ideas aren’t very well fleshed out. Without further ado, let’s begin.


Understanding Thermodynamics by H. C. Van Ness

Understanding Thermodynamics (Dover Books on Physics)

Pretty short, 100 page book that gives an intuitive introduction to various topics in thermodynamics and statistical mechanics. It’s meant to be a supplementary text, not a main text, so some really important things were omitted, which was confusing to me, since I’ve never studied this topic before. Some ideas I learned:

  • Energy can’t really be defined since it’s not a physical property. Can only write it as a sum of a bunch of things, and note that within a closed system, it always stays the same (first law of thermodynamics).
  • A process is reversible if you can do it in reverse to get back the initial state. No physical process is perfectly reversible, but closer it is to reversible, the more efficient it is.
  • Heat engines convert a heat differential into work. Two types are the Otto cycle (used in cars) and the Carnot cycle. Surprisingly, heat engines cannot be perfectly efficient, even under ideal conditions; the Carnot limit puts an upper bound. A heat engine that perfectly converts heat into work violates the second law of thermodynamics.
  • Second law of thermodynamics says that entropy always increases; moreover, it increases for irreversible processes and remains the same for reversible processes. This is useful for determining when a “box of tricks” (taking in compressed air, outputting cold air at one end and hot air at the other end) is possible. The book doesn’t give much intuition about why the definition of entropy makes sense though, it literally tries random combinations of variables until one “works” (gives a constant value experimentally).
  • Second law of thermodynamics is merely an empirical observation, and can’t be proved. In fact, it can be challenged at the molecular level (eg: Maxwell’s demon) which isn’t easily refutable.
  • Statistical mechanics gives an alternate definition of entropy in terms of molecular states, and from it, you can derive various macroscopic properties like temperature and pressure. However, it only works well for ideal gases, and doesn’t quite explain or replace thermodynamics.

Indian Horse by Richard Wagamese

Indian Horse: A Novel

This book is about the life of an Ojibway Indian, living in northern Ontario and growing up in the 60s. When he was young, they sent him to a residential school where he was badly treated and not allowed to speak his own language. He found hockey and got really good at it, but faced problems with racism so he couldn’t really make it in the big leagues with white people. Later, he faced more racism in his job as a logger. Eventually, he developed an alcohol addiction due to this disillusionment and finally comes to terms with his life.

Very interesting perspective on the indigenous people of Canada, a group that most of us don’t think about often. Despite numerous government subsidies, they’re still some of the poorest people in the country, with low education levels. Some people think it’s laziness, but they’ve had a history of mistreatment in residential schools and were subjected to racism until very recently, so it’s difficult for them to integrate into society. Their reserves are often a long distance from major population centers, which means very few opportunities. Furthermore, their culture doesn’t really value education. Overall, great read about a group currently marginalized in Canadian society.

The Power of Habit by Charles Duhigg

The Power of Habit: Why We Do What We Do in Life and Business

Book that discusses various aspects of how habits work. On a high level, habits have three components: cue, routine, and reward. The cue is a set of conditions, such that you automatically perform a routine in order to get a reward. After a while, you will crave the reward when given the cue, and perform the routine automatically (even if the reward is intermittent).

To change a habit, you can’t just force yourself not to do it, because you will constantly crave the reward. Instead, replace the routine with something else that gives a similar reward but is less harmful. Forcing yourself to do something against habit depletes your willpower, so it’s much better to change the habit, so you do it automatically and retain your willpower.

Large changes are often precipitated by a small “keystone” habit change that catalyze a series of systemic changes. For example, Alcoa, an aluminum company, improved its overall efficiency when it decided to focus on safety. Sometimes a disaster is needed to bring about an systemic change in an organization, like a fire in King’s Cross station or operating on the wrong side of a patient in a hospital. Peer pressure is important, for example it’s a key component in Alcoholics Anonymous and making the black civil rights movement go through.

Overall, pretty interesting read, although I think there’s too much dramatic storytelling and anecdotes; I would’ve preferred more scientific discussion and a bit less storytelling.

Why We Sleep by Matthew Walker

Why We Sleep: Unlocking the Power of Sleep and Dreams

This book gives a comprehensive scientific overview of sleep. Although there are still many unanswered questions, there’s been a lot of research lately and this book sums it up.

Sleep is a very necessary function of life. Every living organism requires it, although in different amounts, and total lack of sleep very quickly leads to death. However it’s still unclear exactly why sleep is so important.

There are two types of sleep: REM (rapid eye movement) and NREM sleep. REM sleep is a much lighter form of sleep where you’re closer to the awake state, and is also when you dream; NREM is a much deeper sleep. You can distinguish the type of sleep easily by measuring brain waves.

Sleep deprivation is really bad. You don’t even need total deprivation, even six hours of sleep a day for a few nights is as bad as pulling an all-nighter. When you’re sleep deprived, you’re a lot worse at learning things, controlling your emotions, and you’re also more likely to get sick and more susceptible to cancer.

Dreams aren’t that well understood, but they seem to consolidate memories, including moving them from short term to long term storage. REM sleep especially lets your brain find connections between different ideas, and you’re better at problem solving immediately after.

Insomnia is a really common problem in our society, in part due to it being structured to encourage sleeping less. Sleeping pills are ineffective at best (prescription ones like Ambien and Benzodiazepines are actually really harmful), the recommended treatment is behavioral, like sleeping in a regular schedule, avoiding caffeine and nicotine and alcohol, don’t take naps, avoid light in the bedroom.

My parents always told me it’s bad to stay up so late, but science doesn’t really support this. Different people have different chronotypes, which are determined by genetics (and somewhat changes by age). It’s okay to sleep really late, as long as you maintain a consistent sleep schedule.

Overall I learned a lot from this book but it’s a fairly dense read, with lots of information about different topics, and it took me over a month to finish it.

Notes from the Underground by Fyodor Dostoyevsky

Notes From The Underground

I read this Dostoyevsky book because it had an interesting plot of a man who tries to rescue a prostitute. It turns out that the rescuing prostitute part is not really the central event of the book, but nevertheless I found it quite interesting. The novella is short enough (90 pages) unlike Dostoyevsky’s other books which are super long. It explores a lot of philosophical and psychological ideas in an interesting setting.

The unnamed narrator is a man from the “underground” — he is some kind of civil servant, middle aged, and has health problems. He rejects the idea that man must do the rational thing, as then he is like a machine. He rejoices in doing stupid things from time to time, just because he feels like it, then he can retain some of his humanity. In the second part of the book, the narrator feels like he is not seen as equal by his peers, and goes to extreme lengths to remedy it. He forcefully invites himself to a dinner party with old friends, and is dismayed that his social status is so low that he’s just ignored. He would much rather have a fight than be ignored, and tries to provoke a fight in an autistic manner. Later he meets a prostitute Liza, whom he offers to save. However, when she actually shows up at his place, he is stuck in his own world and lectures to her about the virtues of morality, without actually helping her.

The narrator feels surreal, kind of like valuing social acceptance to an extreme degree. After all, the narrator is physically well-off, he is at least rich enough to hire one servant. However, as long as he feels inferior to his peers, he is frustrated. Also, the more he tries to gain respect from his peers, the more his efforts backfire and his position is lowered in their eyes. Social recognition isn’t something you should pursue directly.

Factfulness by Hans Rosling

Factfulness: Ten Reasons We're Wrong About the World--and Why Things Are Better Than You Think

This book was written by Hans Rosling (the same guy that made The Joy of Stats documentary) just before he died in 2017. It uses stats to show that despite what the media portrays, and despite popular conception, the world is not such a bad place. Extreme poverty is on the decline, children are being vaccinated, women are going to school.

At the beginning of the book, he gives a quiz of 13 questions. Most people score terribly, worse than random chance, by consistently guessing that the world is worse than it actually is. Without looking at stats, it’s easy to be systematically mislead and fall into a bunch of falacies like not considering magnitude of effects, generalizing your experience to others, or acting based on fear. Maybe because of my stats background, a lot of what he says is quite obvious to me. Also I scored 9 on the quiz, which is higher than pretty much everyone. It confirmed some stuff that I already knew, but still it had good insights on poverty and developing nations.

A big takeaway for me is to be thankful of what we have, seeing the difference of lives in levels 1-3. Canada is a level 4 country (where people spend more than $32 dollars a day) yet people make fun of me for making 20k/year “poverty” grad school wages. Grad students in Canada should be thankful that we have electricity, running water, can eat out at restaurants, and not sad that we can’t afford luxury cars and condos.

Sky Burial by Xinran Xue

Sky Burial by Xinran (2005) Paperback

In this novel, a Chinese women, Shu Wen from Suzhou, travels to Tibet to search for her missing husband. This was in 1958, when the Chinese Communist Party annexed Tibet. On the way there, she picks up a Tibetan woman, Zhuoma. They get into some trouble in the mountains and meet a Tibetan family, and gradually Wen integrates into the Tibetan culture and learns the language and customs. Time passes by quickly and before you realize it, 30 years has passed while they have practically no information from the outside world. In the end, Wen does find out what happened to her husband through his diaries, but it’s a bittersweet sort of ending as her world is changed unrecognizably and her husband is dead.

The author makes it ambiguous whether this is a work of fiction or it actually happened — all the facts seem believable, other than somehow not finding out about the great famine and cultural revolution for decades. A lot of interesting Tibetan customs are explained: their nomadic lifestyle, polyamorous family structure, buddhist religious beliefs, and their practice of sky burial which lets vultures eat their dead. The relationship between the Chinese and Tibetan has always been a contentious one, and in this book they form a connection of understanding between the two ethnic groups.

Tibet seems like a really interesting place that I should visit someday. However, it’s unclear how much of their traditional culture is still accessible, due to the recent Han Chinese migrations. Also, it’s currently impossible to travel freely in Tibet without a tour group if you’re not a Chinese citizen.

Getting to YES by Fisher, Ury, and Patton

Getting to Yes: Negotiating Agreement Without Giving In

This book tells you how to negotiate more effectively. A common negotiating mistake is to use positional negotiation, which is each side picking an arbitrary position (eg: buy the car for $5000), and going back and forth until you’re tired and agree, or you both walk out. Positional negotiation is highly arbitrary, and often leads to no agreement, which is bad for both parties.

Some ways to negotiate in a more principled way:

  • Emphasize with the other party, get to know them and their values, treat it as both parties against a common problem rather than you trying to “win” the negotiation.
  • Focus on interests, rather than positions. During the negotiation, figure out what each party really wants; sometimes, it’s possible to give them something that’s valuable for them but you don’t really care about. Negotiation is a nonzero sum game, so try to find creative solutions that fulfill everybody’s interests, rather than fight over a one-dimensional figure.
  • When creative solutions are not possible (both sides just want money), defer to objective measures like industry standards. This gives you both an anchor to use, rather than negotiating in a vacuum.
  • Be aware of your and the other party’s BATNA: best alternative to negotiated agreement. This determines who holds more power in a negotiation, and improving it is a good way to get more leverage.

Trump: A Graphic Biography by Ted Rall

Trump: A Graphic Biography

A biography of Trump in graphical novel format. This book was written after Trump won the republican primaries (May 2016) but before he won the presidency (Nov 2016).

First, the book describes the political and economic circumstances that led to Trump coming into power. After the 2008 financial crisis, many low-skilled Americans felt like there was little economic opportunity for them. Many politicians had come and gone, promising change, but nothing happened. For them, Trump represented a change from the political establishment. They didn’t necessarily agree with all of his policies, they just wanted something radical.

Trump was born after WW2 to a wealthy family in New York City. He studied economics and managed a real estate empire for a few decades, which made him a billionaire. Through his deals in real estate, he proved himself a cunning and ruthless negotiator who is willing to behave unethically and use deception to get what he wanted.

This was a good read because most of my friend group just thinks Trump is “stupid”, and everyone who voted for him is stupid. I never really understood why he was so popular among the other demographic. As a biography, the graphic novel format is good because it’s much shorter; most other biographies go into way too much detail about a single person’s life than I care to know about.

12 Rules for Life by Jordan Peterson

Jordan Peterson’s new book that quickly hit #1 on the bestsellers lists after being released this year. He’s famous around UofT for speaking out against social justice warriors, but I later found out that he has a lot of YouTube videos on philosophy of how to live your life. This book summarizes a lot of these ideas into a single book form, in the form of 12 “rules” to live by, in order to live a good and meaningful life.

These ideas are the most interesting and novel to me:

  • Dominance hierarchy: humans (especially men) instinctively place each other on a hierarchy, where the person at the top has all the power and status, and gets all the resources. Women want to date guys near the top of the hierarchy, and men near the top get many women easily while men at the bottom can’t even find one. Therefore, it’s essential to rise to the top of the dominance hierarchy.
  • Order and chaos: order is the part of the world that we understand, that behaves according to rules; chaos is the unknown, risk, failure. To live a meaningful life is to straddle the boundary between order and chaos, and have a little bit of both.
  • When raising children, it’s the parents’ responsibility to educate them how to behave properly to follow social norms, because otherwise, society will treat them harshly and this will snowball into social isolation later in life. Also, they should be encouraged to do risky things (within reason) to explore / develop their masculinity.

Some of the other rules are more obvious. Examples include: be truthful to yourself, choose your friends wisely, improve yourself incrementally rather than comparing yourself to others, confront issues quickly as they arise. I guess depending on your personality and prior experience, you might find a different subset of these rules to be obvious.

Initially, I found JP to be obnoxious because of the lack of scientific rigour in his arguments, he just seems convincing because he’s well-spoken. The book does a slightly better job than the videos in substantiating the arguments and citing various psychology research papers. JP also has a tendency to cite literature; when he goes into stuff like bible archetypes of Christ, or Cain/Abel, then I have no idea what he’s talking about anymore. The book felt a bit long. Overall still a good read, I learned a lot from this book and also by diving deeper into the psychology papers he cited.

Analects by Confucius

The Analects of Confucius: A Philosophical Translation (Classics of Ancient China)

The Analects (论语) is a book of philosophy by Confucius and lays down the groundwork for much of Chinese thinking for the next 2500 years. It’s the second book I’ve read in ancient Chinese literature after the Art of War. It’s written in a somewhat different style — it has 20 chapters of varying lengths, but the chapters aren’t really organized by topic and the writing jumps around a lot.

Confucius tells you how to live your life not by appeal to religion, but rather by showing characteristics that he considers “good”, and gives examples of what is and what isn’t considered good. A few reoccuring ideas:

  • junzi 君子 – exemplary person. The ideal, wise person that we should strive to be. A junzi strives to be excellent (德) and honorable (信), and not be arrogant or greedy or materialistic. He seeks knowledge, respects elders, is not afraid to speak up, and conducts himself authoratatively.

  • li 礼- ritual propriety. The idea that there are certain “rituals” that society observes, and that if a leader respects them, then things will go smoothly. Kind of like the “meta” in games — modern examples would be the employer/employee relationship, or what situations do you perform a handshake with someone.

  • xiao 孝 – filial responsibility. A son must respect his parents and take care of them in old age, and mourn for them for three years after their death (since for three years after birth, a child is helpless unless for his parents).

  • haoxue 好学 – love of learning for the sake of learning

  • ren 仁 – authorative conduct / benevolence / humanity. Basically a leader should conduct himself in a responsible manner, be fair yet firm.

  • dao 道 – the way. One should forge one’s path through life.

An obvious question is why should we listen to Confucius if there’s no appeal either to a higher power (like the bible) or by axiomizing everything. I don’t really know, but many Chinese have studied this book and lived their lives according to its principles, so by studying it, we can better understand how Chinese think.

I feel like the Analects tells us how an ideal Chinese is “supposed” to think, but modern Chinese people are very much the opposite. Modern Chinese people are generally very materialistic, competitive, and care about comparing themselves to people around them. A friend said much of what is written here is “obvious” to any Chinese person — but then why don’t they actually follow it? I guess modern Chinese society is very unequal, and one must be competitive to rise to the top to prosper. So the cynical answer is that recent economic forces override thousand-year philosophy, which is the ideal, but falls apart when push comes to shove.

The Analects is a very thought-provoking book. It’s surprising how many things Confucius said 2500 years ago is still true today. I probably missed a lot of things in my first pass through it — but this is a good starting point for further reading on Chinese philosophy and literature.

Pachinko by Min Jin Lee

Pachinko (National Book Award Finalist)

Pachinko is the name of the Japanese pinball game, where you watch metal balls tumble through a machine. It’s also the name of this novel, that traces a Korean family in Japan through four generations (Yangjin/Hoonie/Hansu -> Sunja/Isak -> Noa/Mozasu -> Solomon/Phoebe). Sunja is the first generation to immigrate to Japan during the 1930s, after being tricked by a rich guy who got her pregnant. Afterwards, they make their livelihoods in Japan, but they are always considered outsiders, despite being in the country for many generations.

It’s surprising to see so much racism in Japan towards Koreans, since Canada is so multicultural and so accepting of people from other places. Japan is very different: even after four generations in Japan, a Korean boy is still considered a guest and must register with the government every few years or risk getting deported. The Koreans in Japan can’t work the same jobs as the Japanese, can’t legally rent property, and get bullied at school, so they end up working in pachinko parlors, which the Japanese consider “dirty”. All the Korean men: Mozasu, Noa, and Solomon end up working in pachinko, hence the name of the book.

One thing that struck me was how so many of the characters valued idealism more than rationality. Yoseb doesn’t want his wife to go out to work because he considers it improper. Sunja and Noa don’t want to accept Hansu’s help because of shame, even though they could have benefitted a lot, materially. All the Christians have this sort of idealist irrationality, which I guess is part of being religious — only Hansu behaves in a way that makes sense to me. This book gets a bit slow in the end as there are too many minor characters, but is overall a thought provoking read about racism in Japanese society.

Visual Intelligence by Amy Herman

Visual Intelligence: Sharpen Your Perception, Change Your Life

This book uses art to teach you to notice your surroundings more, which is very interesting. The basic premise is there’s a lot of things that we miss, but can be quite important. The two biggest ideas in this book for me:

  1. Train yourself to be more visually perceptive by looking at art, and trying to notice every detail. This seems trivial but often we miss things. Now in the real world, do the same thing and see things in a different way.

  2. Our experiences shape how we perceive things, so it’s important to describe things objectively rather than subjectively. Do not make assumptions, rather, describe only the facts of what you see. From a picture you can’t infer a person is “homeless”, but rather that he’s “lying on a street next to a shopping cart”.

Memoirs of a Geisha by Arthur Golden

Memoirs of a Geisha (Vintage Contemporaries)

This novel tells the story of the geisha Sayuri, from her childhood until her death. It pretends to be a real memoir, but it’s written by an American man. The facts are thoroughly researched, so we get a feel of what Kyoto was like before the war.

Essentially, society in Japan was very unequal — the women have to go through elaborate rituals and endure a lot of suffering to please the men, who just have a lot of money. However, even without formal power, the geishas like Mameha and Hatsumomo construct elaborate schemes of deceit and trickery.

The plot was exciting to read, but certain characters felt flat. Sayuri’s infatuation for the chairman for decades doesn’t seem believable — maybe I would’ve had a crush like that as a teenager, but certainly a woman in her late 20s should know better. Hatsumomo’s degree of evilness didn’t seem convincing either.

Lastly, having read some novels by actual Japanese authors, this book feels nothing like them. Japanese literature is a lot more mellow, and the characters more reserved: certainly nobody would act in such an obviously evil manner. Japanese novels also typically have themes of loneliness and isolation and end with people committing suicide, which doesn’t happen in this novel either.