Data Science Projects – Data Analysis

Project: Waze User Churn Analysis

This project explores user behavior and churn patterns for the Waze navigation app. The goal was to identify factors associated with users leaving the platform, helping inform retention strategies. Using a real dataset of 14,999 users, I conducted an exploratory data analysis (EDA) to uncover insights about engagement, device usage, and driving patterns.

Key Steps – Early Phases

  • Data Cleaning & Missing Values: The dataset had 700 missing labels. Analysis confirmed no other significant missing values, and patterns were explored to understand potential biases.
  • Descriptive Analysis: I summarized usage metrics, including sessions, drives, total kilometers driven, and duration per drive. Comparisons between retained and churned users highlighted key differences.
  • Device Analysis: Users were split between Android and iPhone devices. Interestingly, device type did not significantly influence churn rates.
  • Behavioral Metrics: Median kilometers per drive and per driving day were calculated, revealing that retained users drove slightly more per session and per day than churned users. Similarly, drive frequency per driving day provided insights into user engagement.
  • Visualization & Summary: The results were presented through clear tables and visualizations to illustrate trends, supporting actionable insights for product decisions.

Key Insights

  • Retained users exhibited higher engagement, slightly more kilometers per drive, and a higher median number of drives per day.
  • Churned and retained users were evenly split across device types, suggesting that Android vs. iPhone did not drive user retention differences.
  • Missing labels appeared randomly across devices, with no strong pattern to suggest a data bias.
  • The analysis highlighted the “super-drivers” segment, prompting further investigation into their unique needs.

Outcome & Next Steps

This analysis provides a foundation for developing predictive models to identify at-risk users. Next steps include deeper feature engineering, additional data collection on high-frequency users, and implementing machine learning models to forecast churn probabilities.

Tools Used: Python (pandas, NumPy), Jupyter Notebook, HTML export for reporting.

Kaggle Link

GitHub Link

Waze EDA and Data Viz Phase

The Waze User Churn Analysis project focuses on understanding and predicting user churn for the Waze navigation app. Churn is defined as users who stop using or uninstall the app within a given period. The goal of this project is to identify patterns and factors that contribute to churn, providing actionable insights to improve user retention and guide product development decisions.

The dataset contains user-level metrics such as the number of sessions in the last month, total sessions since onboarding, number of drives, distance driven, duration of drives, activity days, driving days, device type, and a churn label indicating whether a user was retained or churned. The data spans a wide range of user tenures, from new users to long-term users onboarded several years ago.

The project began with data cleaning and preparation. Missing values were addressed, particularly in the churn label, which was filled as ‘unknown’ for completeness. Variables were checked for correct data types, and special attention was given to avoid division-by-zero errors when calculating derived metrics such as kilometers per driving day. Outliers were identified in numeric variables due to right-skewed distributions and were handled by capping values at the 95th percentile, preserving the majority of the distribution while reducing the influence of extreme values.

Next, exploratory data analysis (EDA) was performed to better understand the dataset and the relationships between variables. Key visualizations included:

  • Box plots and histograms for numeric variables such as sessions, drives, total sessions, kilometers driven, duration of drives, activity days, and driving days, highlighting skewness and outliers.
  • Pie charts for categorical variables like device type and churn label to show overall distributions.
  • Stacked histograms showing the proportion of churned versus retained users across variables such as driving days, kilometers per driving day, and percent of sessions in the last month.
  • Scatter plots to examine relationships between driving days and activity days.
  • Correlation heatmaps to identify numeric features most associated with churn.

The outcome of this project is a Python notebook containing clean, well-documented code for data processing, EDA, and visualizations. Derived columns like km_per_driving_day and percent_sessions_in_last_month provide further insights into user behavior. All visualizations are labeled, color-coded, and easy to interpret, making them suitable for presenting findings to stakeholders.

Overall, this project provides a comprehensive understanding of Waze user behavior, identifies factors contributing to churn, and generates actionable visual insights that can guide targeted retention strategies.

GitHub Link


Project 2: Unicorn Companies Analysis

In this project, I explored a dataset of unicorn companies to analyze patterns in industry growth, time to achieve unicorn status, and valuation trends. The goal was to uncover insights into how different industries foster high-growth startups and to identify companies that achieved exceptional milestones.

Using Python and Pandas, I first sampled 50 unicorn companies from the dataset to create a manageable subset for visualization and analysis. I converted key columns, such as “Date Joined” and “Valuation,” into numerical formats to enable accurate calculations. This allowed me to calculate the time each company took to reach unicorn status by subtracting the year founded from the year joined. I also converted valuation strings like “$1.2B” or “$800M” into numeric values to compare maximum valuations across industries.

Data visualization played a crucial role in this analysis. I used matplotlib.pyplot to generate bar charts showing both the longest time to unicorn status per industry and the maximum company valuations by industry. This approach helped highlight patterns, such as industries where startups tend to scale faster or achieve higher valuations. Grouping data by industry provided a clear overview while preserving individual company insights for deeper analysis.

The project demonstrates skills in data cleaning, transformation, exploratory data analysis, and visualization, as well as the ability to communicate findings clearly through visual storytelling. It is particularly relevant for stakeholders interested in startup trends, venture capital, and growth strategy.

The full code, dataset, and visualizations for this project are available on my GitHub and Kaggle repositories for reference and replication:

This analysis highlights my ability to turn raw data into actionable insights, making it a strong example of my data analytics and visualization capabilities for portfolio presentation.

NOAA Lightning Data Analysis

This project explores NOAA lightning strike data using Python to perform exploratory data analysis (EDA) and time-based visualization. The primary objective is to examine temporal patterns in lightning activity and demonstrate practical data preprocessing techniques used in real-world analytical workflows.

A key component of the analysis involves converting the dataset’s date column into a proper datetime format using pandas. This transformation enables accurate time-series plotting, filtering, and aggregation. Once converted, the data can be grouped by day, month, quarter, or year to uncover trends and seasonal patterns in lightning occurrences.

The project walks through the full analytical pipeline: loading and inspecting the dataset, cleaning and preparing variables, transforming date fields, aggregating strike counts, and generating visualizations to support interpretation. By restructuring the time variable into analyzable components, the analysis highlights how datetime conversion is essential for meaningful temporal insights.

Using pandas for data manipulation and matplotlib for visualization, the project demonstrates how structured aggregation and plotting can reveal fluctuations in lightning frequency across months and years. The results help identify recurring seasonal peaks and variations in activity over time.

Scroll to Top