Computer Science Papers Every Developer…

Dr Milan Milanović

Jan 16

143

The foundations of modern software engineering were built on some high-impact research papers.

Read →

9 Comments

Alejandro Avella

Feb 7Edited

hi Milan, I think you should add Google's page rank by Larry Page and Sergei Brin written while they were PhD students at Stanford. The paper is available online just Google it

Expand full comment

Thank you, Alejandro, yes, I will check it.

Expand full comment

I think this should be your source of truth: https://research.google/pubs/the-anatomy-of-a-large-scale-hypertextual-web-search-engine/

Expand full comment

If you search "page rank paper eric" you will get some slides by Eric Roberts for his CN 54 course at Stanford University. I recommend reading these slides before reading the paper from Larry and Sergei.

Expand full comment

The Wikipedia page on page rank has great content: https://en.wikipedia.org/wiki/PageRank

Expand full comment

The Academic Paper That Started Google

https://www.sciencedirect.com/science/article/pii/S016975529800110X

In 1995, Larry Page met Sergey Brin. At the time, Page and Brin were Ph.D students at Stanford University. The two began collaborating on a research project nicknamed “BackRub” with the goal of ranking web pages into a measure of importance by converting their backlink data. Without knowing it at the time, Page and Brin developed the PageRank algorithm that became the original Google Search algorithm. By early 1998, Google had indexed around 24 million web pages. While the home page still marked the project as “BETA”, an article in Salon.com argued that Google’s algorithm gave results that were far better than leading search engines at the time like Hotbot, Excite.com, and Yahoo!. In April of 1998, Page and Brin published a research paper on the topic titled The Anatomy of a Large-Scale Hypertextual Web Search Engine.

In the introduction, Page and Brin noted the problems of popular search engines at the time. For example, while Yahoo! had built a power index that queries keyword matching, they return “too many low-quality matches”. Advertisers would also be able to exploit this by trying to gain people’s attention by populating web pages with “invisible” keywords. According to Page and Brin, Google is different. Its main goal is to “improve the quality of web search engines” by making use of both link structure and anchor text. However, Page and Brin recognize that creating a search engine that scales to the web of 1998 would be difficult as fast web-crawling technologies is necessary to collect and update web pages. In addition, databases must be used more efficiently to store and index the pages as “queries must be handled quickly, at a rate of hundreds to thousands per second”.

Page and Brin coins “PageRank” as an objective measure of a page’s citation importance that corresponds with a person’s subjective idea of importance. It uses the network structure of the Web to calculate a quality ranking for each web page. The founders deduced that given that page A has pages T1…Tn pages that point to it through citations; C(A) is the number of links going out of page A, and d is a constant variable between 0 and 1 (often 0.85) to pad the web pages; the PageRank of page A is:

PR(A) = (1 – d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

At the time, the algorithm has been applied to its database of around 24 million web pages at http://google.stanford.edu. Full raw HTML text of all the pages was also available in its repository.

The rest of the paper provides a low-level view of Google system architecture and its performance, including its code stack, storage process, data structures, and the searching processing algorithm. The full paper can be viewed at http://infolab.stanford.edu/~backrub/google.html. It is around a 15-minute read, and I highly recommend everyone with an interest in databases to read in order to understand the concept of Google’s original algorithm that led to it becoming one of the biggest tech companies in the world.

The concept of PageRank taught in INFO 2040 represents the calculation that Page and Brin developed at a very high level and in a visually-intuitive way. It allows us to understand how components of the web interact with each other, and how PageRank resembles human behavior in ranking importance from authorities. Google succeeded because Page and Brin understood networks long before other tech companies in 1998. The academic paper that they published reflects it.

Reference:

https://blogs.cornell.edu/info2040/2019/10/28/the-academic-paper-that-started-google/

Expand full comment

Like (1)

Clayton Northrup

Jan 24

Wow! This is great. I’ve been recently adding papers to my list of reading materials but obviously the reservoir is massive. Thanks for compiling such a diverse and important list 🙌🏻

Expand full comment

Like (1)

Sridaran Thoniyil

Jan 16

Thanks for the list Milan! I was pleasantly surprised to see how old some of these papers were; it’s really interesting to see how far back fields like functional programming and distributed systems go, and how prevalent this research remains today!

Expand full comment

Exactly that!

Expand full comment

Tech World With Milan Newsletter

Computer Science Papers Every Developer…