Development of distributed machine learning training with a reward system

siddheshtv · April 8, 2024, 5:03pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I’m currently working on a project to develop a library that offers distributed computing and rewards its worker nodes for the amount of computation it provides.

I’m struggling with how can I use ray train and perhaps create an open-ended head node and worker node architecture. I want it to be such that say a worker wants to contribute computing power in return for monetary reward. I have kind of thought about it a bit but am still very unsure.

Here’s what I’m thinking of:

openchainlib

openchainlib aims to reduce machine learning training time by introducing decentralized worker nodes that get rewarded for the computational power they provide. While keeping training costs low, the aim of this library is to contribute towards faster model training and machine learning development.

Why?

Training an ML model is extremely time consuming
Using cloud services might get too costly over time

Definitions

openchainlib works on a client-worker architecture.

Client node - The client would want to train a machine learning model but does not have ample amount of resources to either get the training done faster, or cheaper.

Worker node - A worker is someone who is willingly participating to give away computational power in return for a reward.

Job - A job is something that requires compute power. In our case, this is the machine learning pipeline.

Block - A block consists of n number of jobs. Each block has it’s own unique reward amount which is dependant on the various factors that include: amount of jobs, expected compute power required, time consumed to solve the block, and more.

OhChain! coin (OhC) - It is the fundamental currency of the openchainlib ecosystem. Client nodes that want to get a job done, need to pay the amount in OhChain! coins.

Gas - Gas is a fee structure introduced in order to maintain validity, governance, and security in the openchainlib network. Gas is a reward that is paid to Validator nodes in return for verifying the authenticity of each block.

Proof-of-History - openchainlib works on a Proof-of-History mechanism. The log of all blocks is verified by validator nodes. If there are “x” validator nodes in the network, they all shall agree to the validity of the block by comparing it with the copies that each validator node has stored.

Validator node - A validator node is responsible for verifying the hash values of each block. One cannot directly become a validator node until and unless they hold a stake of more than 39900 in the network. There can only be 20 validator nodes in the entirety of the openchainlib ecosystem.

I’d appreciate suggestions on how I could start with the project and perhaps what would be the very first things that I must develop.

Topic		Replies	Views
About the Ray Train category Ray Train	0	790	August 29, 2021
Reserve workers on GPU node for trainer workers only RLlib	7	1112	June 3, 2022
Scikit Learn Distributed support for Ray Train Ray Train	5	1229	May 15, 2023
My Ray programs stops learning when using distributed compute RLlib	10	1079	August 16, 2022
[Tune] [SGD] [RLlib] Distribute Training Across Nodes with Different GPUs Ray Tune	4	818	September 20, 2022

Development of distributed machine learning training with a reward system

openchainlib

Why?

Definitions

Related topics