How to Build, Distribute and Configure a D-thinker

Introduction

Data Thinker is a scalable data processing system. Such a system is also called a D-thinker or, simply, a thinker. A D-thinker comprises a certain number of computers. It can be as small as several processes in one computer, or as large as tens of thousands of high-end servers. This document introduces how to build and configure a D-thinker by yourself on your own PC or cluster. D-thinker can run on most modern 64-bit Linux systems. 64-bit Fedora 12, 19 and 22 are some of the known good and extensively tested platforms for D-thinker. Alternatively, you may use pre-configured D-thinker environments on ThinkBox, the plug and play big data appliance, or AWS EC2 instances from pre-configured AMI.

Get the D-thinker system components

Set up Cod so that you can obtain the dt software

Cod is a tool that helps retrieve and maintain source and binary files from multiple repositories. To obtain the dt software, you need to set up Cod following the steps here.

Retrieve D-thinker components using Cod

The D-thinker components are in the 'dt' code tree. To get the dt code tree, you need to clone it using git if you have not done it before.

To get dt, run

cod read dt

You also need to retrieve three other repositories, bin, dt-common, and utilib.

cod read bin
cod read dt-common
cod read utilib

How to deploy

You can deploy the D-thinker components to the working node on which you cloned the positories and the tools/packages listed below are installed. The working node will be the portal of the D-thinker.

Prerequisite

These tools/packages needed to be installed to build a D-thinker:

make >= 3.81
gcc >= 4.4.4
glibc-static >= 2.11
gcc-c++ >= 4.4.4
libstdc++-static >= 4.4.4

Deploy

An easy way to deploy a new basic D-thinker is to use the "make install" utility which builds the components and distributes them to your ~/think/ on the portal. In the root directory of the cloned dt repository, run

make install

This will build and install the d-thinker components in ~/think on the portal which is localhost (127.0.0.1). After installation, the "make install" utility conducts a few tests to verify the correctness of the system.

You can fine-tune the behavior of 'make install' by specifying options as described in http://tab.d-thinker.org/showthread.php?tid=5046 .

You may try the deployed D-thinker by logging on the portal using SSH or logging out and logging on again if you working node is the portal, and running this "Hello world" program saved as helloworld.puc:

#include <stdio.h>
void main ()
{
    put8('H','e','l','l','o',' ','w','o');
    writeln4('r','l','d','!');
    commit;
}

You can compile and run the program by (dt run command will be introduced later in detail):

$ dcc helloworld.puc -o helloworld.bin
$ dt run helloworld.bin

It will print "Hello world!" to the STDOUTs if everything runs successfully.

How to configure

Now, you have deployed the D-thinker binaries/scripts to the portal of the D-thinker and configured the single-node working D-thinker. You can use SSH to log on the portal at <portal IP> and configure the D-thinker.

Configure conf/set-env.sh

The configuration of the D-thinker is in the conf directory in ~/think (or your specific "think base", the directory where the D-thinker is installed, specified in the environment variable $think_base). If you do not use the deploy script with the --newme option to set up a D-thinker, the conf directory may not exist yet, and you need to manually create it first.

If the file conf/set-env.sh file does not exist, you can make a copy of it from the template in bin/set-env-template.sh.

Then, you can configure set-env.sh with information of the D-thinker following the instructions in the comments of the template file.

Configure the other files in the conf directory

You need to create these files under the conf directory: memhome, scheduler, vpcs.N, regions, key and nrcs.N (optional). Here, N is the number of VPCs in the D-thinker.

memhome

memhome contains the memory home's IP.

The file contains one line which is the IP of the home.

scheduler

scheduler contains the scheduler/portal's IP.

The file contains one line which is the IP of the scheduler/portal.

vpcs.N

N is the number of VPCs. You can have many vpcs.N for different Ns. vpcs.N contains N specifications for the N vpcs that the D-thinker has. For example, vpcs.10 contains the specification of the D-thinker with 10 VPCs. The number of VPCs in a D-thinker (thinker) is specified in set-env.sh or by command-line parameters introduced later.

The file vpcs.N contains N lines. Each line's format is:

in_{space}_{ip}_{slot_id}

Here, {space} is the space of the VPC of {slot_id} on {ip}. '_' is the separators among different fields. Instead of using '_', you may also use single space or ':' as the separators.

If you do not know which {space} to choose at this step yet, you can use SPACE(0) for all VPCs and change them when you know how to choose.

An example vpcs.10:

in_SPACE(0)_10.1.0.8_0
in_SPACE(0)_10.1.0.8_1
in_SPACE(0)_10.1.0.11_0
in_SPACE(0)_10.1.0.11_1
in_SPACE(10)_10.1.0.3_0
in_SPACE(10)_10.1.0.4_0
in_SPACE(10)_10.1.0.5_0
in_SPACE(10)_10.1.0.6_0
in_SPACE(10)_10.1.0.7_0
in_SPACE(10)_10.1.0.9_0

key

key contains the private key used to ssh to VPC nodes.

You can copy the $think_user's key .ssh/id_rsa to $conf/key, and make sure that the $think_user can password-less ssh to all nodes including those for the home and all VPCs. You may check "Password-less SSH login" for some instructions on password-less ssh login.

regions (optional)

The file regions contains IP to region mapping. Regions are used to specify IPs (hosts) in different fault domains. A fault domain is where faults may occur with natural correlations. For example, a fault domain can be a rack or a datacenter. When there is network problem that partitions a rack from the remaining of the network, all IPs (servers) on that rack are affected and have correlated faults. If there is an electricity outage in an area, all IPs (servers) in a datacenter in the area are affected and have correlated faults. Hence, replication of data should best be directed to different fault domains (regions). Regions can have different properties, such as failure mode, and through configuring regions, the system can achieve high availability when certain failures occur. In a thinker, each IP belongs to one region.

If there is no regions file for a thinker, each IP is in one region.

The format of the regions file is as follows.

{M}
{region_1} {N_1} {ip_11} {ip_12} ... {ip_1{N_1}}
...
{region_M} {N_M {ip_M1} {ip_M2} ... {ip_M{N_M}}

Here, {M} is the lines/regions defined in the configuration file. {region_x} is the region number, e.g. 0, 1 and 2. {N_x} is the number of IPs in {region_x}. {ip_xy} is the y-th IP in {region_x}.

3
0 6 10.1.0.3 10.1.0.6 10.1.0.8 10.1.0.10 10.1.0.11 10.1.0.14
1 2 10.1.0.4 10.1.0.7
2 2 10.1.0.5 10.1.0.9

nrcs.N (optional)

The nrcs.N contains the NRC's IPs which are all the IPs of you VPCs. A NRC controls the resource of a physical node used by the D-thinker and one ore more VPCs may run on the node controlled by the NRC. Each line contains one IP and there should be no duplicated IPs.

The nrcs.10 for the above vpcs.10 is:

10.1.0.8
10.1.0.11
10.1.0.3
10.1.0.4
10.1.0.5
10.1.0.6
10.1.0.7
10.1.0.9

Note that there are fewer than 10 lines in nrcs.10 in this example because there are VPCs on the same node with the same IP.

User-environment configuration

There are environment variables needed by your D-thinker in your shell. The deploy or make install tools can automatically add the commands to your ~/.bashrc during the deployment process. If you choose 'N' during the deployment, you may configure your ~/.bashrc manually as follows.

Here, we use the example that your D-thinker binaries/scripts is distributed to ~/think. If you distribute your D-thinker components to a different directory, you need to replace the ~/think/ in the following part with your own directory path.

First, add this line to your ~/.bashrc (if you are using Bash):

export think_base=$HOME/think

Then, add ~/think/ and ~/think/bin/ to your $PATH.

export PATH=$PATH:$HOME/think:$HOME/think/bin

Here are the lines that you need to add to your ~/.bashrc in total:

export think_base=$HOME/think
export PATH=$PATH:$HOME/think:$HOME/think/bin

Next time you login, the configuration will be automatically included in your bash environment. You need to run . ~/.bashrc to make it take effect in your current bash environment.

How to run a program on a thinker

Under normal user account, run

$ dt run [-n N] program-file.bin [STDIN_file]

Items in [] are optional. Here, N is the number of VPC nodes that you want to use, STDIN_file is the file that contains the STDIN and program-file.bin is the D-CISC binary file that you want to run.

Answer yes if you see messages like:

The authenticity of host '10.0.3.12 (10.0.3.12)' can't be established.
RSA key fingerprint is 9e:e0:f2:57:41:45:46:d1:f4:2b:c4:f6:1a:58:fc:b8.
Are you sure you want to continue connecting (yes/no)?

Correctness test

You can test whether your D-thinker are configured correctly, by the correctness checking tests under $think_base/apps/tests.

You can start the test by:

$ make test

Advanced topics

The following topics are for advanced users only and may not be needed for normal configurations of D-thinker.

How to run more than one thinkers on the same set of nodes

To configuration 2 thinkers A and B to run on the same set of nodes, you need to configure it as follows.

  • A and B use different nodes as their portals.

  • A and B use different nodes as their home.

  • A and B avoid using the same vpc slots on the same node.

  • nrcs will not work correctly. Avoid using nrc functions.

System-wide installation

If you need to make a system-wide instllation of the D-thinker system shared by multiple normal users, you can continue to do the following steps.

User-environment configuration

If you have the root or sudo privilege, you can configure the global thinker by:

$ sudo $think_base/bin/configure-global-thinker.sh

If you installed the data thinker in your home directory, you need to configure the home directory permission to make it open (e.g., set the group permission mode to 755 with sudo privilege, and add dt users to the group) so that other users can access the executables in the think base directory.