Data Science Weekly - Issue 613Curated news, articles and jobs related to Data Science, AI, & Machine LearningIssue #613 |
|
.
Last Week’s Poll:
.
Data Science Articles & Videos
7 Drop-In Replacements to Instantly Speed Up Your Python Data Science Workflows
Many of Python’s most popular data science libraries—including pandas, Polars, scikit-learn, and XGBoost—can now run much faster on GPUs with little to no code changes…Using libraries like NVIDIA cuDF, cuML, and cuGraph, you can keep your existing code and scale it to handle much larger workloads with ease. This post shares how to speed up seven drop-in replacements for popular Python libraries—complete with starter code to try yourself…A typology of literature reviews: Which review is that? A guide to review types
This quick reference tool provides information on a wide range of literature review types that are available for research synthesis for publication and research purposes…
Thing that destroys your reputation as a data engineer [Reddit]
Does anyone have experiences of things they did as a data engineer that they later regretted and wished they hadn’t done?…A Short Introduction to Optimal Transport and Wasserstein Distance
These notes provide a brief introduction to optimal transport theory, prioritizing intuition over mathematical rigor…Why Optimal Transport Theory?…A fundamental problem in statistics and machine learning is to come up with useful measures of “distance” between pairs of probability distributions…Why Scaling a Database Is Harder Than Scaling a Server
Let’s understand why scaling a database is much harder than scaling a server, with clear examples anyone can relate to…A short statistical reasoning test
Here are a few practical questions of my own invention which are easy to comprehend but very difficult to solve without statistical reasoning competence. They are provided in order of difficulty. The answers are at the end…Bioinformatics courses for beginners
Looking for a bioinformatics course that's perfect for beginners? Look no further! In this video, we'll introduce you some of the best bioinformatics computational biology courses available online. The courses are perfect for people who want to learn about the basics of bioinformatics, from computational methods to sequence analysis. After watching this video, you'll have everything you need to start your journey into bioinformatics!…Absence of evidence and confidence intervals
Here are some statistical interpretation anti-patterns:This study found no significant evidence that X had any effect. Therefore…
(Sub-pattern of the first: because of conservation of expected evidence, this is evidence that the true effect is zero.)
Study A found significant evidence for X, but study B didn’t. Therefore, they disagree or there’s heterogeneity.
There are a couple problems with claims like these…
Basic Feature Engineering with DuckDB
In this post, we show how to perform essential machine learning data preprocessing tasks, like missing value imputation, categorical encoding, and feature scaling, directly in DuckDB using SQL. This approach not only simplifies workflows, but also takes advantage of DuckDB’s high-performance, in-process execution engine for fast, efficient data preparation…
From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix
Traditional data engineering at Netflix focused on building structured tables for metrics, dashboards, and data science models. These tables were primarily structured text or numerical fields, ideal for business intelligence, analytics and statistical modeling…Enter Media ML Data Engineering — a new specialization at Netflix that bridges the gap between traditional data engineering and the unique demands of media-centric machine learning. This role sits at the intersection of data engineering, ML infrastructure, and media production…Control of Nonlinear Systems Using Denoising Diffusion Models
In this talk, we view classical control and planning problems from the point of view of sampling from a probability distribution. In contrast to classical sampling, the samples in this scenario have to respect constraints imposed by the control system. Denoising Diffusion Probabilistic Models (DDPMs) is a recently introduced method in machine learning for sampling based image generation…
Data Engineering in 2025 - Key Shifts in Pipelines, Storage, and Tooling [Reddit]
Some patterns I’ve noticed across multiple industries:Unified Batch + Streaming Architectures - Tools like Apache Flink and RisingWave are making it easier to blend historical batch data with real-time streams in a single workflow.
Data Contracts - More teams are introducing formal schema agreements between producers and consumers to reduce downstream breakages.
Iceberg/Delta Lake adoption surge - Open table formats are becoming the default for large-scale analytics, replacing siloed proprietary storage layers.
Cost-optimized pipelines - Teams are actively redesigning ETL to ELT, pushing more transformations into cloud warehouses to reduce compute spend.
Shift-left data quality - Data validation is moving earlier in the pipeline with tools like Great Expectations and Soda Core integrated right into ingestion steps…
Mistakes, we’ve drawn a few: Learning from our errors in data visualisation
At The Economist, we take data visualisation seriously. Every week we publish around 40 charts across print, the website and our apps. With every single one, we try our best to visualise the numbers accurately and in a way that best supports the story. But sometimes we get it wrong. We can do better in future if we learn from our mistakes — and other people may be able to learn from them, too. After a deep dive into our archive, I found several instructive examples…
.
Last Week's Newsletter's 3 Most Clicked Links
.
* Based on unique clicks.
** Find last week's issue #612 here.
Cutting Room Floor
Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions
tinytable: a small but powerful R package to draw beautiful tables in a variety of formats
Running OpenAI GPT OSS Locally with Docker Model Runner and R
.
Whenever you're ready, 2 ways we can help:
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.Promote yourself/organization to ~68,500 subscribers by sponsoring this newsletter. 30-40% weekly open rate.
Thank you for joining us this week! :)
Stay Data Science-y!
All our best,
Hannah & Sebastian
You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.


Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.