Understanding Distributed Computing

A gentle introduction

Renu Gehring
Towards AI
Published in
5 min readMar 14, 2024

--

Photo by charlesdeluvio on Unsplash

I am having coffee and a slice of poundcake one afternoon with my imaginary friend, Mr. Pound, in his charming and entirely fictitious bakery, “Pound Cakes and More”. Business has been great, my friend tells me, and he is thinking of expanding.

A note about my friend, Mr. Pound. In addition to being an astute businessman and an excellent confectioner, he is really into data science, data, and technology. In his spare time, he analyzes data that he collects meticulously. I really like talking to him because hey, who can turn down an offer of free pound cake with a side of cool techie talk?

Back to expanding “Pound Cakes and More”. Mr. Pound has one small oven and a mixer, and both are optimized to make four cakes at the same time. He believes that he can expand in two ways. Option (1), Go Big, is to purchase an XXL oven and mixer that will make 12 cakes concurrently. Option (2), Distributed Baking, is to purchase 3 additional pairs of small ovens and mixers. With both options, he will be able to make 12 additional cakes at the same time, but each option has its advantages and disadvantages.

With Go Big, Mr. Pound is worried about cake quality since he is not sure about the evenness of the temperature in the XXL oven. “Maybe the cakes in the corners will burn and the ones in the center will be under-done”, he frets. With Distributed Baking, he will need to carefully set baking times with four different ovens. He might, he muses, be able to hire his nephew to help.

Suddenly, Mr. Pound’s face lights up. “I think that I have figured out what distributed computing is.” he declares.

Photo by CDC on Unsplash

Mr. Pound continues excitedly. “It is like me trying to clean the two seating areas in my bakery. Instead of just me, I would hire two people and each one would clean one room. Then I would supervise their results. The supervision would be extra effort, but the cleaning would get done in about half the time”.

As I help myself to another melt-in-your-mouth slice of poundcake, I interject, “Mr. Pound, have you considered the task of sorting numbers? How might you sort three random numbers? And then one hundred numbers? And finally, a thousand numbers?”

Well-used to my leading questions, Mr. Pound responds affably. “With three numbers, it is easy. I would simply do them in my head. All in memory-processing and no distribution required”, he says with a smile.

“With one hundred numbers, I would grab a pencil and a piece of paper. I would peruse my numbers to find the lowest, write it down, go through the remaining 99 numbers to find the next higher number, repeating this process until I had a new perfectly ordered list of one hundred numbers. Isn’t this similar to building a bigger computer and utilizing in-memory processing as well as temporary writing to disk?”, Mr. Pound says with a chuckle.

“With a thousand numbers…I don’t think I want to do this by myself”, Mr. Pound says. He pauses, thinks for a bit, and resumes, “I would hire ten helpers. And give them one hundred numbers each. They order their own hundred numbers and hand in their work. So now I have ten sheets with one hundred sorted numbers.” Mr. Pound pauses again and looks thoughtful. He continues, “I look at the first row of the 10 lists and write down the minimum number. I can then look at the same row and some additional rows to find the next highest number. Tedious, but do-able”, he declares.

Having finished my second scrumptious slice, I chime in, “Mr. Pound, this is why distributed computing began. As data grew, so did computers. But data grew faster. So, Doug Cutting and his brilliant friends created a system of connecting computers to each other and dividing work across these connecting computers. Now work was distributed among these computers (executors) but there was also a “master” computer that collected intermediate results, did additional work, and then presented the complete task to the end user. Doug Cutting christened this new system Hadoop after his young son’s toy elephant,” I say smiling.

“This system proved itself quickly. To sort one TB of data, Hadoop took three and a half minutes whereas the closest non-distributed system took three and a half days!”, I say excitedly.

Because Mr. Pound looks a bit confused, I add, “One TB of data is close to 1,000,000,000,000 bytes and that is twelve zeroes.”

“But the problem was that this early version of Hadoop was difficult to use by those not trained in computer science. Many new tools and languages popped up, all named for various animals. There was HIVE, Pig, Impala, even Zookeeper”, I say, laughing.

“So did the old code work on the new distributed system?” Mr. Pound asks.

“No”, I reply. “Code had to be adapted. Many programming paradigms went out the window.”

Mr. Pound jumps in, “So if I wrote a loop, that would not distribute well because it works on one row at a time, right? And table indexing would not translate well because it assumes that the entire table is in one place.”

“Yes, that is right”, I say, adding, “But Hadoop evolved to become easier to use. And a new system called Spark was created. Spark integrates well with Hadoop, plays nicely with different types of hardware, and is generally easy to use. It has become dominant today. And now there are tools like PySpark that work much better with distributed data than, say, Pandas.”

Mr. Pound interjects, “So I cannot use Pandas iloc construct?” Knowing that Mr. Pound loves Pandas for analyzing his data, I assure him, “But PySpark is easy. Hey, you can even use Generative AI to help convert your Pandas code to PySpark.”

I continue, “Distributed computing made advances in Generative AI possible. Well, that and the Transformer models, which are designed to iterate to a closer understanding of language on a massively parallel scale.”

Mr. Pound smiles and gets up. “I think I have made up my mind. I am going with Distributed Baking”, he says. I get up, too, knowing that my friend wants to go home, cook dinner, and await his wife, who is expecting their first child. We say our goodbyes and as I leave, Mr. Pound calls out laughing, “Hey, what do you think the chances are of us figuring out how to grow a baby in one month with 9 people?”.

Some things, I thought, are best left undistributed.

--

--

Data science consultant, instructor, and author. Bringing passion, expertise, and experience to data and AI driven solutions