GLOSSARY: XGBoost is the Secret of ML Energy

четверг, 7 сентября 2023 г.

XGBoost is the Secret of ML Energy

Can't read or see images? View this email in a browser

When an ML model has to deal with tabular data, it’s XGBoost (Extreme Gradient Boosting) that energises the model’s performance and computational speed. XGBoost stands as a tree-based ensemble machine learning algorithm renowned for its superior predictive capabilities and performance. The powerful machine learning algorithm is more capable of training a model to find patterns in a dataset with labels and features than LLMs.

https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExMWQ5ZnNyYXJ0bnZ1YnQyN2RiMjhtb3RicHZtYnJrM2x2dWRoMGpraSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/26gR2s2PIjg0gDM88/giphy.gif

XGBoost should be considered for any supervised learning task when there’s a substantial number of training examples. Besides, it excels when dealing with a blend of categorical and numeric features. It is particularly effective in scenarios where the dataset comprises a mix of these feature types or when the developer is exclusively working with numeric features.

A question that often arises here is, why choose XGBoost when you have LLMs? In fact, tabular-data-focused data scientists are deeply divided when it comes to choosing between XGBoost, lightBGM, and LLMs.

LLMs effectively classify tabular data with minimal preprocessing, though at the expense of time. To apply LLMs to tabular data, emerging approaches like prompt engineering are being explored but are still in the early stages of development.

Instead of relying solely on textual outputs, the focus is shifting towards using the internal embeddings generated by LLMs, known as latent structure embeddings. These embeddings can be integrated into traditional tabular models like XGBoost.

While Transformers have revolutionised generative AI, their primary strengths remain in handling unstructured and sequential data, as well as tasks involving intricate patterns. This convergence of techniques is a promising step towards more versatile and efficient machine learning models.

Read the full story here.

Toolkit for Ethical AI

Developing and deploying AI ethically and responsibly is of utmost importance, and a range of toolkits are available to assist this endeavour. Here are a few:

NASSCOM Responsible AI Resource Kit: In 2022, NASSCOM collaborated with industry leaders like Microsoft, TCS, and IBM Research to launch the Responsible AI Hub and toolkit, aiming to ensure ethical AI integration and offering ongoing guidance.
AI and data protection risk toolkit: The Information Commissioner’s Office of the UK launched an AI and data protection toolkit last year as part of the effort to spread best practices in the use of AI.
Ethics in Tech Toolkit: Developed by the Markkula Center for Applied Ethics at Santa Clara University, the toolkit provides free resources for everyone to integrate ethics into their products and designs.

Read the full story here.

Law-breaker Llama 2

With new developments, Llama 2 is breaking many laws to emerge as a unique model which people are using to train models. The latest development is in the form of TinyLlama.

A research assistant at Singapore University has initiated the training of TinyLlama, a 1.1 billion parameter model inspired by Llama 2. His goal is to pre-train TinyLlama on a massive dataset of 3 trillion tokens. The ambitious goal goes against the Chinchilla scaling law that says that for training a Transformer-based language model to achieve optimal compute, the number of parameters and the number of tokens for training the model should scale in approximately equal proportions.

Read the full story here.

OpenAI-backed Startups

https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExYmJpbmN5aHhsZW1qaXcyN2Z6cGVhNjh1YXBlMjhnd3p3dHNlY2N0biZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/KVioxFliECSZUmTWAw/giphy.gif

At Microsoft Build’s 2021 conference, OpenAI CEO Sam Altman introduced the OpenAI Startup Fund, initially planning to invest $100 million to support AI startups for a positive global impact. Subsequently, in May, OpenAI secured $175 million for this fund.

The fund has backed seven startups across various sectors, including robotics, law, education, and more. Of these, two are available as ChatGPT plugins, while two are not yet operational. Major beneficiaries include Speak, a language-learning app that recently raised $16 million, and Descript, an AI-powered audio and video editing tool that has garnered significant investment. Other supported ventures include Mem, Harvey, Milo, 1X, and Charles AI. OpenAI aims to further AI development by investing in startups in countries like India and South Korea.

Read the full story here.