Data Science Weekly - Issue 501

Curated news, articles and jobs related to Data Science

Jun 30

Issue #501
June 29 2023

Hello and thank you for tuning in to Issue #501.

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

***

Seeing this for the first time? Subscribe here:

***

Want to support us? Become a paid subscriber here.

***

If you don’t find this email useful, please unsubscribe here.

***

And now, let's dive into some interesting links from this week:

Editor's Picks

How to make your scientific data accessible, discoverable and useful
Nature asked data scientists about their best practices for publishing usable, high-quality data — here’s what they said…

The Many Ways that Digital Minds can Know
Detractors of LLMs describe them as blurry jpegs of the web, or stochastic parrots. Promoters of LLMs describe them as having sparks of AGI, or learning the generating function of multivariable calculus. These positions seem opposed to each other and are the subject of acrimonious debate. I do not think they are opposed, and I hope in this post to convince you of that. In particular:
1. LLMs do both of the things that their promoters and detractors say they do.
2. They do both of these at the same time on the same prompt.
3. It is very difficult from the outside to tell which they are doing.
4. Both of them are useful.
In this post I’m going to introduce some new terminology that I think will be useful to reason about them, and hopefully shed some of the connotations and baggage of prior terminology that are poorly suited for the phenomena that they are now tasked with describing…..

Naming things
There’s a reason we often joke that naming things is one of the two hard things in programming, but we often say it in kind of a hopeless, “Haha, this is crazy, what can we do about it” kind of way…Luckily, I recently came across a gem of a book in my Amazon recommendations, “Naming Things” by Tom Benner, that aims to address this. The book is super short, but I strongly recommend it for anyone looking to get more understanding behind the ghost knowledge that guides naming conventions in programming…Here are some of my favorite highlights…

A Message from this week's Sponsor:

Prodigy: a radically efficient data annotation tool for machine teaching

Prodigy is a scriptable and developer friendly data annotation tool made by the same developers who created spaCy. It’s designed for rapid iteration on your datasets and allows you to fully customize the annotation process. There’s also a new alpha version out with support for prompt engineering, LLM guided annotation and tools to specify annotator overlap and task routing.

You can get a personal license of Prodigy with a discount this month! Just use coupon code DSW-2023 at checkout.

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

People who use python for data science - what are the use cases for building your own classes? [Reddit Discussion]
I've been doing (fairly light) data science with Python for a couple of years now and have got by without building my own classes. I've been learning them recently though and am intrigued at how they might help with data science. I can see how useful they'd be for game design, but am struggling to think of DS applications. Are people writing their own? And if so, what for?..

Why ChatGPT And Copilot Are bad For Developers [Video]
In this video I will share my personal opinion on why GitHub Copilot and ChatGPT is bad for you as a programmer…

Data Delight: "Weave" Your Way to Interactive Exploration! 🕺💡
I'm a passionate advocate of tools that can elevate the Jupyter Notebook experience. Among the various tools my teams utilize, Nbdev has proven to be invaluable. Today, I'm excited to share another tool that has caught my attention from the Fully Connected Conference hosted by Weights & Biases last Wednesday, and that I'd like to introduce: Weave…Weave is a brand-new open-source toolkit designed for “performant, interactive data exploration” within your familiar environment. Per its authors from Weights & Biases: Our mission is to equip Machine Learning practitioners with the best tools to turn data into insights quickly and easily…
Boost your power with baseline covariates
This is the first post in a series on causal inference. Our ultimate goal is to learn how to analyze data from true experiments, such as RCT’s, with various likelihoods from the generalized linear model (GLM), and with techniques from the contemporary causal inference literature. We’ll do so both as frequentists and as Bayesians…
Bing Maps Global Building Footprints Released
Microsoft Maps has a dedicated Maps AI (artificial intelligence) team that has been taking advantage of Microsoft’s investments in deep learning, computer vision, and ML (machine learning). Applying all that cool tech to mapping has yielded many useful datasets and our latest worldwide dataset release includes a whopping 1.2B building footprints and 174M building height estimates from Bing Maps imagery between 2014 and 2023 including Maxar, Airbus, and IGN France imagery…

Understanding missing data mechanisms using causal DAGs
Using causal missingness DAGs to understand missing data and choose an appropriate analysis. Ever been confused by terms like MCAR, MAR, MNAR? Wondered if there's any way to deal with data being Missing Not At Random? This might help…
LLM Powered Autonomous Agents
Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver…

June 2023, A Stage Review of Instruction Tuning
Following the great success of ChatGPT, on February 24, 2023, the emergence of LLaMA heated up the direction of instruction tuning. On March 18, Alpaca demonstrated the potential of distilling smaller models from mature ones to become decent ChatBots, triggering a Cambrian explosion of llama-based models. However, just three months later, people began to recognize the various problems of training LLaMA with ChatGPT's data. This article reviews the development of LLaMA-based models over the past three months and discusses the next challenges of Instruction Tuning…

What are the major advantages of having deep understanding of ML algorithms?
Given the fact that the process to "find the best model" is just testing all models and their hyperparameters, what benefits do I have of having deep understanding of ML algorithms? I mean, even without knowledge of any algorithm, anyone can import all the algorithms, put in a pipeline and select the best…I think that having deep understanding of the algorithms can lead to a better intuition of what would work or not, but since the only way to prove it is testing, I can't see that much value in it…What am I missing?…

WizMap - Exploring and interpreting large embeddings in your browser!
Explore and interpret large embeddings in your browser with interactive visualization! 📍…

Passive Learning of Active Causal Strategies in Agents and Language Models
If you're interested in what LMs or agents can learn about causal strategies from passive data, a recording of a talk I gave at @ic_arl a few weeks ago is out now!..

awesome-quarto
A curated list of Quarto talks, tools, examples & articles!…

Jobs

Data Scientist at The LEGO Group

Are you passionate about data and would like to help The LEGO Group to discover deeper insights, make better predictions, or generate relevant product recommendations?

This is your chance to apply data science in a real business context to contribute to one of the world’s best-loved brands. Our team is responsible for the LEGO Builder app (Digital building instructions). Help us become data driven in our development, by finding patterns in the data and thus helping us get the insights to the usage and users affinity –thereby helping us to build an even more engaging and proven experience for the Builders of tomorrow.

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

SQL Window Functions
The window function is an advanced feature of SQL that allows you to manipulate and perform advanced analytics without the need to write complex queries. In this article we will talk about Windows functions, the different types, and when to use the window function…
Learn the fundamentals of generative AI for real-world applications
In Generative AI with Large Language Models (LLMs), you’ll learn the fundamentals of how generative AI works, and how to deploy it in real-world applications…
The Animated Transformer
The Transformer is foundational to the recent advancements in large language models (LLMs). In this article, we will attempt to unravel some of its inner workings and hopefully gain some insight into how these models function…

Last Week's Newsletter's 3 Most Clicked Links

Why professors are so bad at giving advice

Prompt Engineering 201: Advanced methods and toolkits

People: The API User’s Guide

* Based on unique clicks.
** Find last week's issue #500 here.

Cutting Room Floor

Thanks for joining us this week :)

All our best,
Hannah & Sebastian

P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Upgrade to paid

Comment

Restack

GLOSSARY

четверг, 29 июня 2023 г.