{"id":406,"date":"2024-06-03T18:45:30","date_gmt":"2024-06-03T18:45:30","guid":{"rendered":"https:\/\/datascienceproject.eu\/blog\/?p=406"},"modified":"2024-06-10T06:47:06","modified_gmt":"2024-06-10T06:47:06","slug":"titanic","status":"publish","type":"post","link":"https:\/\/datascienceproject.eu\/blog\/titanic\/","title":{"rendered":"How to do the Titanic Kaggle competition"},"content":{"rendered":"\n<p>Imagine it&#8217;s a cold April night in 1912, and the RMS Titanic, a marvel of human engineering, is embarking on its maiden voyage from Southampton to New York City. Aboard are over 2,200 passengers and crew members, representing a cross-section of early 20th-century society. From the wealthiest industrial magnates to humble emigrants dreaming of a new life in America, the Titanic is a floating microcosm of the world.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/BlogDSP_Titanic_-800_600.jpg\" alt=\"\" class=\"wp-image-402\" srcset=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/BlogDSP_Titanic_-800_600.jpg 800w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/BlogDSP_Titanic_-800_600-300x225.jpg 300w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/BlogDSP_Titanic_-800_600-768x576.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>But as the ship sails through the icy waters of the North Atlantic, it strikes an iceberg, leading to one of the most infamous maritime disasters in history. The tragedy claimed the lives of more than 1,500 people, and in its wake, it left a poignant question: Who had the best chance of survival?<\/p>\n\n\n\n<p>Fast forward to the present day. You are a data scientist, part historian and part detective, tasked with solving this century-old mystery. Your tools are not ropes and pulleys but algorithms and data. You have been given access to a treasure trove of information about the Titanic&#8217;s passengers: their ages, genders, ticket classes, and more. <\/p>\n\n\n\n<div style=\"background-color: #f9f9f9; border-left: 5px solid #A4B2D9; padding: 10px 20px; margin: 20px 0; font-size: 1.2em; font-weight: bold; color: #333; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px;\">\n    Your mission is to build a predictive model that can answer a vital question: Given the characteristics of a passenger, can we predict whether they <span style=\"color: #EFAB7C;\">survived<\/span> or <span style=\"color: #EFAB7C;\">perished<\/span> in the disaster?\n<\/div>\n\n\n\n<p>This is not just a data science problem; it&#8217;s a journey through time. It&#8217;s about understanding the human stories behind the numbers, recognizing patterns of survival, and uncovering insights that were lost to the icy depths. As you delve into this project, you&#8217;ll be stepping into the shoes of a historical detective, using the power of modern technology to shed new light on an old tragedy.<\/p>\n\n\n\n<p>Are you ready to embark on this journey? Let&#8217;s set sail into the world of data, where each row tells a story, and each prediction brings us closer to understanding the human drama of the Titanic. Together, we&#8217;ll uncover the secrets of survival, one data point at a time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Collecting Data as a Data Scientist<\/h2>\n\n\n\n<p>As a data scientist, your journey begins with gathering the essential pieces of your puzzle: the data. In the realm of data science, the quality and relevance of your data are paramount. Here\u2019s an overview of the methods data scientists commonly use to collect data:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. <strong>Open Data Repositories<\/strong><\/h3>\n\n\n\n<p>One of the richest sources of data is <strong><span style=\"color: #EFAB7C;\">open data repositories<\/span><\/strong>. These are databases where organizations, governments, and research institutions publish datasets for public use. \nWebsites like <a href=\"https:\/\/www.kaggle.com\/\">Kaggle<\/a>, <a href=\"http:\/\/archive.ics.uci.edu\/ml\/index.php\">UCI Machine Learning Repository<\/a>, and <a href=\"https:\/\/www.data.gov\/\">data.gov<\/a> are treasure troves for data scientists.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: <a href=\"https:\/\/www.kaggle.com\/competitions\/titanic\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">The Titanic Kaggle competition<\/a> provides a curated dataset, eliminating the need for initial data collection and allowing you to focus on analysis and modeling.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. <strong>APIs (Application Programming Interfaces)<\/strong><\/h3>\n\n\n\n<p>APIs allow data scientists to fetch data programmatically from various services. Many companies and organizations offer APIs to access their data. For instance, social media platforms like Twitter and Facebook provide APIs for fetching data on user posts, likes, and shares.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: If you were to enhance your Titanic analysis with weather data on the night of the sinking, you might use a historical weather API to gather this information.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. <strong>Web Scraping<\/strong><\/h3>\n\n\n\n<p>When data is not readily available through an API, web scraping can be an effective technique. This involves writing scripts to extract data from websites. Tools like BeautifulSoup and Scrapy in Python make web scraping manageable.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: If additional Titanic passenger lists or articles were only available on certain websites, you could scrape this information to enrich your dataset.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. <strong>Databases<\/strong><\/h3>\n\n\n\n<p>Data scientists often access data stored in databases. This involves querying databases using SQL (Structured Query Language) to retrieve the necessary data. Companies usually store their operational data in databases, making this method essential for many business applications.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: If you were working within a company that had historical maritime records, you could query their database to get data on passenger manifests and ship logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5. <strong>Surveys and Sensors<\/strong><\/h3>\n\n\n\n<p>For primary data collection, surveys and sensors are invaluable. Surveys allow you to collect data directly from individuals, while sensors can collect environmental data. This method is particularly useful when existing data is not available or needs to be specific to a particular research question.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: If you were conducting a study on passenger experiences, you might create a survey to gather firsthand accounts from Titanic historians or descendants of passengers.<\/li>\n<\/ul>\n\n\n\n<div style=\"background-color: #f9f9f9; border-left: 5px solid #A4B2D9; padding: 10px 20px; margin: 20px 0; font-size: 1.2em; font-weight: bold; color: #333; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px;\">\n    Sensors are devices that detect and respond to physical input, converting it into electrical signals. They come in various types, including environmental, biometric, motion, proximity, and accelerometers. Applications of sensors range from environmental monitoring and healthcare to smart cities and manufacturing.\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\">6. <strong>Collaboration and Partnerships<\/strong><\/h4>\n\n\n\n<p>Collaborating with other organizations or institutions can provide access to data that is not publicly available. Partnerships can be particularly useful in academic and industrial research.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Example<\/strong>: Partnering with a maritime museum could grant you access to rare artifacts and documents related to the Titanic, providing unique insights into the passengers&#8217; stories.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">The Titanic Dataset: Ready for Exploration<\/h2>\n\n\n\n<p>In the case of the Titanic Kaggle competition, the data is conveniently curated and ready for us to explore. This saves us significant time and effort.<\/p>\n\n\n\n<p>To dive into the Titanic dataset and begin your data science journey, follow these steps using the Kaggle Notebook environment:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Sign Up on Kaggle<\/strong>: If you haven\u2019t already, create an account on <a href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Kaggle<\/a>.<\/li>\n\n\n\n<li><strong>Find the Titanic Competition<\/strong>: Navigate to the <a href=\"https:\/\/www.kaggle.com\/competitions\/titanic\" data-type=\"link\" data-id=\"https:\/\/www.kaggle.com\/competitions\/titanic\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Titanic: Machine Learning from Disaster<\/a> competition page.<\/li>\n\n\n\n<li><strong>Fork the Kaggle Notebook<\/strong>:\n<ul class=\"wp-block-list\">\n<li>On the competition page, go to the \u201cNotebooks\u201d tab.<\/li>\n\n\n\n<li>Find a relevant notebook or start a new one by clicking on &#8220;New Notebook.&#8221; This will open the Kaggle Notebook environment where you can run your code directly without needing to download anything to your local computer.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Load the Dataset in Kaggle Notebook<\/strong>. \n<ul class=\"wp-block-list\">\n<li>The Titanic datasets (<code>train.csv<\/code> and <code>test.csv<\/code>) are readily accessible within the Kaggle environment. <\/li>\n\n\n\n<li>Use Python and Pandas to load and explore the data directly in the Kaggle Notebook.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>By performing these steps, you have successfully collected your data, transforming raw historical records into a structured dataset ready for analysis.<\/p>\n\n\n\n<p>To begin our analysis, open our Kaggle Notebook and run the initial code to load the dataset. We create two data frames, <code>train_df<\/code> and <code>test_df<\/code>, using the following code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>train_df = pd.read_csv('\/kaggle\/input\/titanic\/train.csv')\ntest_df = pd.read_csv('\/kaggle\/input\/titanic\/test.csv')<\/code><\/pre>\n\n\n\n<p>We can take a look at these data frames by running:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>train_df.head()\ntest_df.head()<\/code><\/pre>\n\n\n\n<p>As observed, the <code>train_df<\/code> contains the &#8216;Survived&#8217; column, whereas the <code>test_df<\/code> does not.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Kaggle_DataFrame.png\" alt=\"\" class=\"wp-image-514\" srcset=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Kaggle_DataFrame.png 800w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Kaggle_DataFrame-300x225.png 300w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Kaggle_DataFrame-768x576.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>This absence is because the <code>test_df<\/code> is used for making predictions and does not include the target variable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Purpose of Splitting Data into Training and Testing Sets<\/h3>\n\n\n\n<p>In a data science project, the existence of two distinct datasets, one with the target variable and one without it, is a common and purposeful setup. This practice ensures that our models can generalize well to new, unseen data and are not just memorizing the training examples.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Why Split the Data?<\/h4>\n\n\n\n<p>After collecting data, it is crucial to split it into at least two sets: a training set and a testing set. This split is fundamental for the following reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Model Training<\/strong>:\n<ul class=\"wp-block-list\">\n<li>The training dataset (<code>train_df<\/code>) includes the target variable\u2014&#8217;Survived&#8217; in this case\u2014which is the outcome we aim to predict. This dataset is used to teach our machine learning model by allowing it to learn from historical data. The model examines the relationships between the features (like age, fare, and class) and the target variable to understand what factors contribute to survival on the Titanic.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Model Evaluation<\/strong>:\n<ul class=\"wp-block-list\">\n<li>The testing dataset (<code>test_df<\/code>) does not include the target variable. This is intentional and serves a crucial purpose in the model evaluation process. Once our model is trained, we use the testing dataset to make predictions. Since the true outcomes (whether the passengers survived or not) are not provided in this dataset, it simulates a real-world scenario where we need to predict outcomes for unseen data. After making these predictions, we compare them against the actual outcomes (which are known but not included in the dataset we worked with) to assess the model&#8217;s performance.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Preventing Overfitting<\/strong>:\n<ul class=\"wp-block-list\">\n<li>By splitting the data, we can ensure that our model is not just memorizing the training data but is genuinely capable of generalizing its learning to new, unseen data. Overfitting occurs when a model learns the training data too well, including its noise and outliers, which negatively impacts its performance on new data. A separate testing set helps in detecting and preventing overfitting.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Real-World Simulation<\/strong>:\n<ul class=\"wp-block-list\">\n<li>Using a testing set simulates a real-world scenario where the model encounters new data it has never seen before. This helps us evaluate how well our model will perform in practical applications, providing a robust measure of its predictive power.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Model Improvement<\/strong>:\n<ul class=\"wp-block-list\">\n<li>By evaluating the model on the testing set, we can identify its weaknesses and improve it iteratively. This feedback loop is essential for refining the model and ensuring its accuracy and reliability.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>In a captivating YouTube video, Cassie Kozyrkov sheds light on the dual nature of data points in our data science endeavors. <\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"The most powerful idea in data science\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/e9KJ3kd80fQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>By splitting our data, we can dedicate one portion to training our models, enabling them to learn patterns and relationships from historical data. Meanwhile, the other portion serves as a testing ground, providing an opportunity to evaluate the model&#8217;s performance on new, unseen data. <\/p>\n\n\n\n<blockquote class=\"instagram-media\" data-instgrm-captioned data-instgrm-permalink=\"https:\/\/www.instagram.com\/p\/CkvpMMSjt6W\/?utm_source=ig_embed&amp;utm_campaign=loading\" data-instgrm-version=\"14\" style=\" background:#FFF; border:0; border-radius:3px; box-shadow:0 0 1px 0 rgba(0,0,0,0.5),0 1px 10px 0 rgba(0,0,0,0.15); margin: 1px; max-width:540px; min-width:326px; padding:0; width:99.375%; width:-webkit-calc(100% - 2px); width:calc(100% - 2px);\"><div style=\"padding:16px;\"> <a href=\"https:\/\/www.instagram.com\/p\/CkvpMMSjt6W\/?utm_source=ig_embed&amp;utm_campaign=loading\" style=\" background:#FFFFFF; line-height:0; padding:0 0; text-align:center; text-decoration:none; width:100%;\" target=\"_blank\" rel=\"noopener\"> <div style=\" display: flex; flex-direction: row; align-items: center;\"> <div style=\"background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 40px; margin-right: 14px; width: 40px;\"><\/div> <div style=\"display: flex; flex-direction: column; flex-grow: 1; justify-content: center;\"> <div style=\" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 100px;\"><\/div> <div style=\" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 60px;\"><\/div><\/div><\/div><div style=\"padding: 19% 0;\"><\/div> <div style=\"display:block; height:50px; margin:0 auto 12px; width:50px;\"><svg width=\"50px\" height=\"50px\" viewBox=\"0 0 60 60\" version=\"1.1\" xmlns=\"https:\/\/www.w3.org\/2000\/svg\" xmlns:xlink=\"https:\/\/www.w3.org\/1999\/xlink\"><g stroke=\"none\" stroke-width=\"1\" fill=\"none\" fill-rule=\"evenodd\"><g transform=\"translate(-511.000000, -20.000000)\" fill=\"#000000\"><g><path d=\"M556.869,30.41 C554.814,30.41 553.148,32.076 553.148,34.131 C553.148,36.186 554.814,37.852 556.869,37.852 C558.924,37.852 560.59,36.186 560.59,34.131 C560.59,32.076 558.924,30.41 556.869,30.41 M541,60.657 C535.114,60.657 530.342,55.887 530.342,50 C530.342,44.114 535.114,39.342 541,39.342 C546.887,39.342 551.658,44.114 551.658,50 C551.658,55.887 546.887,60.657 541,60.657 M541,33.886 C532.1,33.886 524.886,41.1 524.886,50 C524.886,58.899 532.1,66.113 541,66.113 C549.9,66.113 557.115,58.899 557.115,50 C557.115,41.1 549.9,33.886 541,33.886 M565.378,62.101 C565.244,65.022 564.756,66.606 564.346,67.663 C563.803,69.06 563.154,70.057 562.106,71.106 C561.058,72.155 560.06,72.803 558.662,73.347 C557.607,73.757 556.021,74.244 553.102,74.378 C549.944,74.521 548.997,74.552 541,74.552 C533.003,74.552 532.056,74.521 528.898,74.378 C525.979,74.244 524.393,73.757 523.338,73.347 C521.94,72.803 520.942,72.155 519.894,71.106 C518.846,70.057 518.197,69.06 517.654,67.663 C517.244,66.606 516.755,65.022 516.623,62.101 C516.479,58.943 516.448,57.996 516.448,50 C516.448,42.003 516.479,41.056 516.623,37.899 C516.755,34.978 517.244,33.391 517.654,32.338 C518.197,30.938 518.846,29.942 519.894,28.894 C520.942,27.846 521.94,27.196 523.338,26.654 C524.393,26.244 525.979,25.756 528.898,25.623 C532.057,25.479 533.004,25.448 541,25.448 C548.997,25.448 549.943,25.479 553.102,25.623 C556.021,25.756 557.607,26.244 558.662,26.654 C560.06,27.196 561.058,27.846 562.106,28.894 C563.154,29.942 563.803,30.938 564.346,32.338 C564.756,33.391 565.244,34.978 565.378,37.899 C565.522,41.056 565.552,42.003 565.552,50 C565.552,57.996 565.522,58.943 565.378,62.101 M570.82,37.631 C570.674,34.438 570.167,32.258 569.425,30.349 C568.659,28.377 567.633,26.702 565.965,25.035 C564.297,23.368 562.623,22.342 560.652,21.575 C558.743,20.834 556.562,20.326 553.369,20.18 C550.169,20.033 549.148,20 541,20 C532.853,20 531.831,20.033 528.631,20.18 C525.438,20.326 523.257,20.834 521.349,21.575 C519.376,22.342 517.703,23.368 516.035,25.035 C514.368,26.702 513.342,28.377 512.574,30.349 C511.834,32.258 511.326,34.438 511.181,37.631 C511.035,40.831 511,41.851 511,50 C511,58.147 511.035,59.17 511.181,62.369 C511.326,65.562 511.834,67.743 512.574,69.651 C513.342,71.625 514.368,73.296 516.035,74.965 C517.703,76.634 519.376,77.658 521.349,78.425 C523.257,79.167 525.438,79.673 528.631,79.82 C531.831,79.965 532.853,80.001 541,80.001 C549.148,80.001 550.169,79.965 553.369,79.82 C556.562,79.673 558.743,79.167 560.652,78.425 C562.623,77.658 564.297,76.634 565.965,74.965 C567.633,73.296 568.659,71.625 569.425,69.651 C570.167,67.743 570.674,65.562 570.82,62.369 C570.966,59.17 571,58.147 571,50 C571,41.851 570.966,40.831 570.82,37.631\"><\/path><\/g><\/g><\/g><\/svg><\/div><div style=\"padding-top: 8px;\"> <div style=\" color:#3897f0; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:550; line-height:18px;\">View this post on Instagram<\/div><\/div><div style=\"padding: 12.5% 0;\"><\/div> <div style=\"display: flex; flex-direction: row; margin-bottom: 14px; align-items: center;\"><div> <div style=\"background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(0px) translateY(7px);\"><\/div> <div style=\"background-color: #F4F4F4; height: 12.5px; transform: rotate(-45deg) translateX(3px) translateY(1px); width: 12.5px; flex-grow: 0; margin-right: 14px; margin-left: 2px;\"><\/div> <div style=\"background-color: #F4F4F4; border-radius: 50%; height: 12.5px; width: 12.5px; transform: translateX(9px) translateY(-18px);\"><\/div><\/div><div style=\"margin-left: 8px;\"> <div style=\" background-color: #F4F4F4; border-radius: 50%; flex-grow: 0; height: 20px; width: 20px;\"><\/div> <div style=\" width: 0; height: 0; border-top: 2px solid transparent; border-left: 6px solid #f4f4f4; border-bottom: 2px solid transparent; transform: translateX(16px) translateY(-4px) rotate(30deg)\"><\/div><\/div><div style=\"margin-left: auto;\"> <div style=\" width: 0px; border-top: 8px solid #F4F4F4; border-right: 8px solid transparent; transform: translateY(16px);\"><\/div> <div style=\" background-color: #F4F4F4; flex-grow: 0; height: 12px; width: 16px; transform: translateY(-4px);\"><\/div> <div style=\" width: 0; height: 0; border-top: 8px solid #F4F4F4; border-left: 8px solid transparent; transform: translateY(-4px) translateX(8px);\"><\/div><\/div><\/div> <div style=\"display: flex; flex-direction: column; flex-grow: 1; justify-content: center; margin-bottom: 24px;\"> <div style=\" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; margin-bottom: 6px; width: 224px;\"><\/div> <div style=\" background-color: #F4F4F4; border-radius: 4px; flex-grow: 0; height: 14px; width: 144px;\"><\/div><\/div><\/a><p style=\" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; line-height:17px; margin-bottom:0; margin-top:8px; overflow:hidden; padding:8px 0 7px; text-align:center; text-overflow:ellipsis; white-space:nowrap;\"><a href=\"https:\/\/www.instagram.com\/p\/CkvpMMSjt6W\/?utm_source=ig_embed&amp;utm_campaign=loading\" style=\" color:#c9c8cd; font-family:Arial,sans-serif; font-size:14px; font-style:normal; font-weight:normal; line-height:17px; text-decoration:none;\" target=\"_blank\" rel=\"noopener\">A post shared by Anna (@datascienceproject.eu)<\/a><\/p><\/div><\/blockquote> <script async src=\"\/\/www.instagram.com\/embed.js\"><\/script>\n\n\n\n<p>This approach ensures that our models can generalize well and make accurate predictions in real-world scenarios. With Kozyrkov&#8217;s insights in mind, we&#8217;re reminded of the critical role that data splitting plays in building robust and reliable predictive models in data science.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Maximizing Insights: The Power of Combining Testing and Training Sets<\/h3>\n\n\n\n<p>In any Titanic data science project, one of the critical steps to maximize insights and predictive accuracy is the strategic combination of the testing and training sets. This process offers a plethora of benefits, from enhancing data consistency to streamlining model validation. Let&#8217;s delve into why this integration is pivotal and how it can significantly elevate your analysis.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Holistic Data Analysis<\/strong>: Merging the testing and training sets creates a unified dataset that encompasses all available information about the Titanic passengers. This holistic approach enables a comprehensive analysis, unveiling deeper insights and uncovering hidden patterns that might remain undiscovered when the datasets are analyzed separately.<\/li>\n\n\n\n<li><strong>Consistency in Preprocessing<\/strong>: Data preprocessing, including handling missing values and encoding categorical variables, is fundamental for building robust machine learning models. Combining the datasets ensures that preprocessing steps are applied consistently across both the training and testing data. This consistency minimizes discrepancies and ensures that the model is trained on properly processed data.<\/li>\n\n\n\n<li><strong>Feature Engineering and Selection<\/strong>: A combined dataset provides a broader pool of data for feature engineering, allowing for the creation of new features or transformation of existing ones to improve model performance. It also facilitates informed feature selection decisions, as the relevance and predictive power of features can be evaluated across both datasets.<\/li>\n\n\n\n<li><strong>Model Validation and Evaluation<\/strong>: Accurate validation and evaluation of machine learning models are essential for building reliable predictive models. Combining the datasets enables robust model validation through techniques like k-fold cross-validation, ensuring that the model&#8217;s performance metrics reflect its true predictive ability and generalization capability.<\/li>\n\n\n\n<li><strong>Streamlined Workflow<\/strong>: Working with a single combined dataset simplifies the analytical workflow, reducing complexity and streamlining tasks such as data preprocessing, model training, and evaluation. This streamlined approach enhances efficiency and makes it easier to manage and analyze the data.<\/li>\n<\/ol>\n\n\n\n<p>Now, let&#8217;s dive into the code to combine the testing and training sets using Python and Pandas:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Exploring Titanic Dataset Features<\/h2>\n\n\n\n<p>The Titanic dataset encapsulates a wealth of information about the passengers aboard the ill-fated ship. Each feature provides a unique glimpse into the demographics and circumstances surrounding the passengers.  To illustrate this, let&#8217;s take a closer look at one of the passengers, Lily May Peel, also known as Mrs. Jacques Heath Futrelle.<\/p>\n\n\n\n<p>Lily May Futrelle&#8217;s journey on the Titanic is a poignant chapter in the ship&#8217;s history. She was a first-class passenger, reflecting her high socio-economic status. The dataset records her as a female, aged 35 at the time of the voyage. The &#8220;SibSp&#8221; feature shows she was traveling with one sibling and the &#8220;Parch&#8221; feature, with a value of 0, indicates she had no parents or children with her. <\/p>\n\n\n\n<p>Traveling with her husband, Jacques Futrelle, they boarded the Titanic in Southampton. Their ticket number, 113803, and the fare they paid, \u00a353.1, are indicative of the luxurious accommodations they enjoyed. They stayed in cabin C123, located in a prime area of the ship, which likely offered better access to lifeboats. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Lily-May-Futrelle.png\" alt=\"\" class=\"wp-image-459\" srcset=\"https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Lily-May-Futrelle.png 800w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Lily-May-Futrelle-300x225.png 300w, https:\/\/datascienceproject.eu\/blog\/wp-content\/uploads\/2024\/06\/Lily-May-Futrelle-768x576.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>Lily&#8217;s story, extracted from these data points, provides a richer understanding of the demographics and circumstances that influenced survival on the Titanic. Her high socio-economic status, reflected in her ticket class and fare, along with her gender, played crucial roles in her survival. The combination of these factors offers a compelling example of how individual passenger data can illuminate broader trends and patterns in the Titanic tragedy.<\/p>\n\n\n\n<p>Let&#8217;s analyse each feature separately.<\/p>\n\n\n\n<p>1. <strong>PassengerId<\/strong> &#8211; a unique identifier assigned to each passenger. This unique identifier allows us to differentiate between individual passengers. While inherently not predictive, it enables tracking and analysis of individual passengers&#8217; data.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n        While PassengerId itself may not directly impact survival, maybe the order of boarding might reflect proximity to lifeboats or access to resources, potentially influencing survival probabilities?\n    <\/div>\n<\/div>\n\n\n\n<p><strong>2. Survived<\/strong> &#8211; an indicator of survival, where 0 represents &#8220;did not survive&#8221; and 1 represents &#8220;survived&#8221;. This feature is present only in the training dataset, serving as the target variable for predictive modeling. As the target variable, Survived indicates whether a passenger survived the Titanic disaster (1) or not (0). Exploring survival rates across different demographic groups provides crucial insights into factors influencing survival.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n        It&#8217;s hypothesized that factors such as ticket class, age, gender, and family size may correlate with survival probabilities. For instance, women and children may have had higher survival rates due to evacuation priority, while first-class passengers might have had easier access to lifeboats.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>3. Pclass<\/strong> &#8211; ticket class indicating the socio-economic status of the passenger (Values: 1 = First class, 2 = Second class, 3 = Third class). Ticket class reflects passengers&#8217; socio-economic status, with 1st class indicating higher status and 3rd class representing lower status. Analyzing survival rates by ticket class reveals disparities in survival probabilities across socio-economic strata.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n        It&#8217;s assumed that passengers in higher classes (1st class) had better access to resources, lifeboats, and possibly higher priority during evacuation, resulting in higher survival rates compared to lower classes (3rd class).\n    <\/div>\n<\/div>\n\n\n\n<p><strong>4. Name <\/strong>&#8211; the name of the passenger, including titles such as Mr., Mrs., Master, etc. While primarily a categorical feature, names can offer insights when parsed for titles (e.g., Mr., Mrs., Miss). These titles may correlate with socio-economic status or gender, providing additional context for analysis.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n        While the name itself might not directly impact survival, parsing titles (e.g., Mr., Mrs., Miss) could provide insights into socio-economic status or gender, which might influence survival probabilities.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>5. Sex<\/strong> &#8211; gender of the passenger, categorized as male or female. Gender plays a significant role in survival dynamics, with societal norms influencing evacuation priorities. Analyzing survival rates by gender sheds light on potential gender-based disparities in survival.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n         Gender-based evacuation policies, where women and children were given priority, might lead to higher survival rates among females compared to males.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>6. Age<\/strong> &#8211; age of the passenger in years. This feature may contain missing values. Age distribution among passengers provides insights into demographics and may correlate with survival probabilities. Exploring age groups and survival rates uncovers age-related trends in survival dynamics.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n         It&#8217;s expected that age could impact survival probabilities, with children and elderly passengers potentially having higher survival rates due to evacuation priority, while young adults might have been more physically capable of survival actions.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>7. SibSp <\/strong>&#8211; number of siblings or spouses aboard the Titanic. The number of siblings\/spouses aboard can indicate family size and support networks. Analyzing survival rates by SibSp reveals how familial relationships influenced survival probabilities.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n       Passengers traveling with siblings or spouses might have benefited from mutual support during evacuation, potentially leading to higher survival rates compared to solo travelers.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>8. Parch <\/strong>&#8211; number of parents or children aboard the Titanic. Similarly, the number of parents\/children aboard reflects family dynamics. Analyzing survival rates by Parch uncovers how parental responsibilities affected survival probabilities.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n       Passengers accompanied by parents or children might have prioritized family members during evacuation, potentially increasing their chances of survival.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>9. Ticket<\/strong> &#8211; ticket number assigned to the passenger. Ticket numbers may contain alphanumeric codes or patterns indicative of ticket types or purchase locations. Analyzing ticket features can provide insights into ticketing systems and passenger origins.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n      Ticket types or purchase locations might correlate with socio-economic status or passenger origins, which could impact access to resources and survival probabilities.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>10. Fare <\/strong>&#8211; the fare paid by the passenger for the ticket. Fare paid for tickets reflects passengers&#8217; purchasing power and ticket class. Exploring fare distributions and survival rates by fare brackets illuminates socio-economic disparities in survival.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n   Higher fares might indicate higher ticket classes and socio-economic status, which could correlate with better access to resources and higher survival probabilities.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>11. Cabin<\/strong> &#8211; cabin number where the passenger stayed. This feature may contain missing values. Cabin numbers may correlate with cabin locations on the ship, influencing proximity to lifeboats and evacuation routes. Analyzing survival rates by cabin location provides insights into spatial dynamics of survival.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n  Cabin locations might influence proximity to lifeboats and evacuation routes, potentially impacting survival probabilities.\n    <\/div>\n<\/div>\n\n\n\n<p><strong>12. Embarked <\/strong>&#8211; port of embarkation for the passenger (Values: C = Cherbourg, Q = Queenstown, S = Southampton). Port of embarkation reflects passengers&#8217; embarkation locations and potentially their socio-economic backgrounds. Analyzing survival rates by embarkation port reveals regional disparities in survival probabilities.<\/p>\n\n\n\n<div style=\"display: flex; align-items: center;\">\n    <div style=\"background-color: #BFD1D4; width: 30px; height: auto; display: flex; justify-content: center; align-items: center; color: #7B7B7B;\">\n        <span style=\"font-size: 1.5em;\">?<\/span>\n    <\/div>\n    <div style=\"background-color: #BFD1D4; padding: 10px 20px; margin: 20px 0; font-size: 1em; color: #7B7B7B; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); border-radius: 5px; display: flex; align-items: center; width: calc(100% - 30px);\">\n  Passengers embarking from different ports might have varied socio-economic backgrounds or demographics, which could impact survival probabilities.\n    <\/div>\n<\/div>\n\n\n\n<p>Exploring these features offers valuable insights into the demographics, socio-economic status, and familial relationships among Titanic passengers. Understanding the nuances of each feature is crucial for data preprocessing, visualization, and model building in the quest to uncover patterns of survival.<\/p>\n\n\n\n<p>After a thorough analysis of each feature, we have gained valuable insights into the Titanic dataset. Now, armed with this understanding, it&#8217;s time to dive deeper into the actual data using Python within a Jupyter notebook environment.<\/p>\n\n\n\n<p>Python, with its powerful libraries such as Pandas, NumPy, and Matplotlib, provides us with the tools needed to explore, visualize, and analyze the dataset comprehensively. By leveraging these tools, we can uncover hidden patterns, correlations, and trends within the data.<\/p>\n\n\n\n<p>In the upcoming sections, we will walk through loading the Titanic dataset into a Jupyter notebook, performing data preprocessing tasks, visualizing key features, and building predictive models to tackle the challenge posed by the Titanic Kaggle competition.<\/p>\n\n\n\n<p>Let&#8217;s embark on this journey of data exploration and analysis, as we uncover the stories hidden within the Titanic dataset and strive to build predictive models that shed light on the fate of its passengers.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Unlock the secrets of the Titanic&#8217;s tragic voyage with this data science challenge! Step into the shoes of a historical detective and use modern algorithms to predict who survived that fateful night. Join me on this intriguing journey through time and data.<\/p>\n","protected":false},"author":3,"featured_media":402,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,35],"tags":[36],"class_list":["post-406","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-project","category-python","tag-data-science-project"],"_links":{"self":[{"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/posts\/406","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/comments?post=406"}],"version-history":[{"count":87,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/posts\/406\/revisions"}],"predecessor-version":[{"id":535,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/posts\/406\/revisions\/535"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/media\/402"}],"wp:attachment":[{"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/media?parent=406"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/categories?post=406"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascienceproject.eu\/blog\/wp-json\/wp\/v2\/tags?post=406"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}