Jul 2, 2025

The Internship That Started It All

This pipeline was pretty unstable at first. But it’s still running.

My internship started in 2017 and ended in 2019. It was two years full of learning — and a time when I really began to understand what working with technology means.

But in the first days, I did the kind of work every intern does: lots of manual tasks.

To give you some context, the company was a startup focused on a cashback system. Users could buy items in registered stores and receive a percentage back in their wallet app.

I joined the Data Science team, whose first big project was building a recommendation system. The goal was to suggest stores to users based on buying patterns similar to other users.

So what did I do at first? I manually categorized stores. I had to look up the store’s address, website, phone number, and a bunch of other information on the internet.

In the end, I categorized more than 2,000 stores by hand — which was absolutely crazy… and honestly, really boring.

The exciting part of the job — writing code, making queries, setting up data ingestion, and training machine learning models — was all being handled by the Data Engineer and Data Scientist.

In the beginning, our team was small: just the manager, one DE, one DS, and me. Plenty of challenges for such a tiny group.

After I finished the manual categorization work, my manager encouraged me to dive deeper into Data Science. The first book he suggested was An Introduction to Statistical Learning. So I started reading it — but at that time, I was completely confused by the advanced math, statistics, and how to actually apply those concepts with Python or R.

I tried to understand it for weeks, but eventually I went back to my manager and admitted that Data Science just wasn’t for me, at least not then. So he decided to have me work more closely with the Data Engineer and learn what he was doing.

That’s when I really started getting my hands dirty: learning how to query real data, how to build ETLs, the tools we needed to use, and what a Data Engineer actually does day to day.

The Data Engineer on our team turned out to be one of the biggest mentors in my life — at least, that’s how I see it. He taught me the fundamentals of data engineering, was incredibly patient, and would explain things over and over until I truly understood. Today, he’s also one of my best friends, and we still keep in touch.

Now, I want to share some of my biggest challenges in those early data engineering days.

The first project was building the frontend to interact with the recommendation system, which we called Maya. The only requirement was to use React. It was my first time coding for a real product — and my first time with frontend work at all.

Honestly, it was pretty funny looking back. I didn’t even know how to get started with React. At one point, I opened the browser developer tools, copied HTML from other websites, and tried to tweak it into what I needed for Maya. Eventually, the other Data Engineer stepped in and showed me how to actually start the project and what the first steps should be. In the end, I managed to develop all the requirements and delivered my first project.

The second challenge was building the frontend for another system called Atlas, where we had to plot all the stores and users on a map, clustering them by buying patterns. This also used React, but the real challenge was figuring out how to render and optimize such a large amount of data.

The third — and probably the biggest — challenge was migrating a huge ETL job that consumed data from our sales table. This query was so heavy it almost brought down the database — and yes, I was the one who originally wrote it.

To fix it, I decided to use AWS EMR since we were already on AWS for other services. Learning Spark was tough: understanding how it worked internally, how distributed computing communicates, and how to actually structure the process for our needs was a big leap.

After about a month full of errors and small victories, I finally figured out how to submit jobs properly to EMR. I also learned how to integrate it with Apache Airflow so we could schedule and trigger the ETL in production.

Honestly, I probably learned more from chasing down Spark errors than from half the theory I ever studied.

That project was by far the biggest milestone in my early career. It took a job that ran in 8 hours down to just 20 minutes, which made a huge difference for my team and the company.

That project changed everything. It opened doors to new opportunities, pushed me deeper into the data world, and left me wondering what else I could build next.

But that’s a story for another post.