|
1
|
- Andy Kowalski
- November 5, 2002
|
|
2
|
- Hardware
- Network
- Servers (JASMine)
- Farm Nodes
- Software
- LSF
- Configuration
- Priorities
- Queues
|
|
3
|
- Software cont.
- JobServer
- User Commands
- jsub - Command File Options
- Sample Jobs
- Life of a Job
- Submission
- Dependencies
- jcache (pre-staging data files)
- Execution
- Completion
|
|
4
|
|
|
5
|
- Foundry BigIron 8000
- 64 Gigabit Ethernet ports
- Foundry BigIron 15000
- 120 Gigabit Ethernet ports
- Cisco 2900 XL and 2950 Switches
- Gigabit Ethernet uplinks
- 24 100mbit Ethernet ports
- Servers Connected via Gigabit Ethernet
- Farm Nodes Connected via 100mbit
|
|
6
|
- 2 StorageTek Powderhorn Tape Libraries
- 6000 tape cartridge capacity each
- 8 Redwood tape drives (EOL 1/03)
- 10 9840 tape drives
- 15 9940A tape drives
- 5 9940B tape drives (Testing)
- 720TB Capacity with 9940A tapes
- 2.4PB Capacity with 9940B tapes
|
|
7
|
- 12 Data Movers
- Sun E4000 - 200GB stage area
- Sun E3000 - 220GB stage area
- 10 Linux PCs – 130GB to 400GB stage area
|
|
8
|
- Disk Caches of Files already on Tape
- 18 Linux NFS File Servers
- ~14TB of usable disk space
- Mixture of RAID-0 and RAID-5
- Mixture of SCSI and IDE based systems
|
|
9
|
- Used to Pre-Stage Files from Tape for the Farm Jobs
- 4 Linux NFS File Servers
- ~1.5TB of usable disk space
- RAID-0
- Older SCSI based systems
|
|
10
|
- Store Data/Results from the Farm
- Not Backed Up
- 4 Symbios File Servers
- Decommissioning now - 5years old
- HallA and HallC
- 8 Linux NFS File Servers
- ~12TB of usable disk space
- HallA 4TB, HallC 2TB, CLAS 4TB
- RAID-5 only
- Mixture of SCSI and IDE based systems
|
|
11
|
- Work and Cache File Server Hardware is Now the Same
- File Servers May Not be Separated by Their Function in the Future
- 6 Additional Servers Arriving Soon
- ~12TB of usable disk space
- Can be used as cache or work space
|
|
12
|
- Linux Only
- Farm (14,304 SPECint95):
- 20 - Dual 450MHz - 688 SPECint95
- 25 - Dual 500MHz - 1030 SPECint95
- 50 - Dual 750MHz - 3560 SPECint95
- 60 - Dual 1GHz - 5526 SPECint95
- 20 - Dual 1.8GHz (Xeon) - 3500 SPECint95
|
|
13
|
- Memory
- Disk Space
- Each Node Runs 3 Jobs
- Each job is expected to use no more than 5GB
|
|
14
|
- LSF (Load Sharing Facility)
- Commercial product from Platform Computing Corp.
- Manages job queues and schedules job execution
- Configuration
- A total of 3 running jobs per node
- A limit of 2 jobs per user per node
- Fair Share scheduling
- Group/User Priority
- user_share / ( .01 + cpu_time * CPU_Factor + run_time * Run_Factor +
run_jobs *Job_Factor )
- Job_Factor is set to .30
- CPU_Factor and Run_Factor are set to 0
|
|
15
|
- Job Queues
- idle
- FIFO scheduling
- Dispatched to idle farm nodes
- Jobs are preemptable
- low_priority
- Round Robin scheduling
- Similar to the idle queue but the jobs are not preemptable
- CPU limit of 48 hours (2880 minutes) on a 750MHz farm node
- Must include the TIME keyword in the jsub command file
|
|
16
|
- Job Queues cont.
- production
- Fair share scheduling
- 80% CLAS, 20% HallA/C
- 70% CLAS production users, 30% other CLAS users
- Round Robin between Halls A and C
- priority
- FIFO scheduling
- CPU limit of 30 minutes on a 750MHz farm node
- Must include the TIME keyword in the jsub command file
- jcache
- Used by the system
- Pre-stage files from tape to the farm cache disks
|
|
17
|
- JLab Developed Front End to LSF
- Written in Java
- Works around LSF license limitations
- Creates job dependencies when pre-staging files from tape
- Creates job clusters
- Many jobs created from 1 jsub
|
|
18
|
- jobstat
- Lists status information about farm jobs
- jobstat [-h] [-w | -l] [-a] [-d] [-p] [-s] [-r]
- [-u username |
-u all] [-J jobname] [jobId ...]
- -w wide format
- -l long format - displays full
details for each job
- -a show information about all
jobs
- -d show recently finished jobs
only
- -p show only pending jobs - and
the reasons why
- -s show only suspended jobs
- -r show only running jobs
|
|
19
|
- ifarms1> jobstat -u clase5 -r
- JOBID USER STAT
QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
- 303901 clase5 RUN production farms0-old. farml115.jl *24038.A24 Oct 15 11:13
- 304520 clase5 RUN production farms0-old. farml109.jl *24035.A08 Oct 15 21:40
- 304521 clase5 RUN production farms0-old. farml109.jl *24035.A09 Oct 15 21:40
- A State/Status of UNKWN means the LSF server cannot communicate with
the farm node and is not getting status updates. Since the farm node cannot be
contacted the job cannot be killed.
This is almost always because the farm node has crashed.
- ifarms1> jobstat -u clase6 –p 332914
- JOBID USER STAT
QUEUE FROM_HOST JOB_NAME SUBMIT_TIME
- 332914 clase6 PEND
production
farms0-old.jla Run31972 Nov 4 15:56
- Job dependency condition not
satisfied;
- jobstat -u clase6 -p 332282
- JOBID USER STAT
QUEUE FROM_HOST JOB_NAME SUBMIT_TIME
- 332282 clase6 PEND
production
farms0-old.jla Run31953 Nov 4 10:43
- Load information unavailable: 15
hosts;
- Unable to reach slave batch
server: 3 hosts;
- Job slot limit reached: 142
hosts;
|
|
20
|
- ifarms1> jobstat -u clase5 -l 303901
- Job <303901>, Job Name <e5cook_clas_024038.A24>, User
<clase5>, Project <clas>,
- Mail
<clase5@jlab.org>, Status <RUN>, Queue <production>,
- Command <#
LSF script - generated by JOBS Oct 15 2002 11:1
-
3:10;#!/bin/csh;#BSUB -J e5cook_clas_024038.A24;#BSUB -P c
- las;#BSUB -R
"linux";#BSUB -c 100000;#BSUB -u clase5@jlab.
- org;setenv
JOB_ID $LSB_JOBID;/apps/bin/rfcp mss1.jlab.org:
- /cache/ms>
- Tue Oct 15 11:13:11: Submitted from host <farms0-old.jlab.org>,
CWD </tmp>, Requested Resources <linux>;
- CPULIMIT
- 100000.0 min of farml115
- Tue Oct 15 11:18:25: Started on <farml115.jlab.org>, Execution
Home </home/clase5>, Execution CWD </tmp>;
- Mon Nov 4 15:40:52: Resource
usage collected.
- MEM: 372
Kbytes; SWAP: 1 Mbytes
- PGID:
4129; PIDs: 4143
- SCHEDULING PARAMETERS:
- r15s r1m
r15m ut pg io
ls it tmp swp
mem
- loadSched -
- - -
- - -
- - - -
- loadStop -
- - -
- - -
- - - -
|
|
21
|
- jkill
- Kill queued and executing batch jobs
- jkill [ -s signal ] jobid1 [ jobid2 ] ... [ jobidN ]
- -s signal the signal used to
kill the job
- jobid the id of the job as seen
with jobstat.
|
|
22
|
- farmqueues
- displays information about queues
- farmqueues [-hwlr] [-m hostname]
- [-u
username] [queuename ...]
- -w wide format
- -l long format - displays full
details for each job
- -r if fairshare is defined for
the queue, displays recursively the share account tree
- -m hostname displays the queues
that can run jobs on the specified host or host group
- -u username displays the queues
that can accept jobs from the specified user
|
|
23
|
- ifarms1> farmqueues
- QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND
RUN SUSP
- priority
100 Open:Active - - - - 1 1 0 0
- jcache
100 Open:Active - - - - 25 1 24 0
- FARM_TEST 100 Open:Active - - 1 - 0 0 0 0
- production
80 Open:Active - - - - 939 516 423 0
- clas_g6c
80 Open:Active - - - 2 0 0 0 0
- low_priority
5 Open:Active - - 1 - 2 1 1 0
- idle
1 Open:Active - - 1 - 280 42 16 222
|
|
24
|
- Ifarms1> farmqueues -r production
- QUEUE: production
- -- Production batch jobs - the
majority of the work load. This
is the default queue.
- PARAMETERS/STATISTICS
- PRIO NICE STATUS MAX JL/U
JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV
- 80 5
Open:Active
- - - - 935 513 422 0 0 0
- SCHEDULING PARAMETERS
- r15s r1m
r15m ut pg io
ls it tmp swp
mem
- loadSched -
- - -
- - -
- - - -
- loadStop -
- - -
- - -
- - - -
- SCHEDULING POLICIES: FAIRSHARE
- USER_SHARES: [clas_grp, 80]
[hall_grp, 20] [others, 1]
- SHARE_INFO_FOR: production/
- USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
- others 1 3.333 0 0 0.0 0
- hall_grp 20 1.333 49 0 1202784.5 48385432
- clas_grp 80 0.713 373 0 92687192.0 213897814
|
|
25
|
- SHARE_INFO_FOR: production/hall_grp/
- USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
- hallb_grp 1 0.278 11 0 1114696.2 1497154
- hallc_grp 1 0.167 19 0 71424.9 19528535
- halla_grp 1 0.167 19 0 16663.3 27359743
- SHARE_INFO_FOR: production/clas_grp/
- USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
- clas_users 30 1.639 60 0 78108512.0 129045653
- clas_prod 70 0.743 313 0 14578609.0 84852161
- SHARE_INFO_FOR: production/clas_grp/clas_users/
- USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
- luminita 1 3.333 0 0 0.0 0
- bellis
1 0.175 18 0 8380542.5 19749789
- rakhsha 1 0.119 27 0 64043392.0 90224603
- SHARE_INFO_FOR: production/clas_grp/clas_prod/
- USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
- clase2 1 0.064 51 0 560487.1 2306963
- clase1-6 1 0.037 89 0 8942625.0 31715359
- clase6 1 0.020 168 0 5008346.0 42207491
|
|
26
|
- farmhosts
- displays hosts and their static and dynamic resources
- farmhosts [-w | -l] [-R res_req] [host_name]
- -w wide format
- -l long format - displays full
details for each host
- -R res_req displays information
about hosts that satisfy the resource requirement
|
|
27
|
- ifarms1> farmhosts -w
- HOST_NAME STATUS JL/U MAX
NJOBS RUN SSUSP
USUSP RSV
- farml1.jlab.org
closed_Full
3 4 5 4 1 0 0
- farml101.jlab.org
closed_Full
2 3 5 3 2 0 0
- farml102.jlab.org
closed_Full
2 3 3 3 0 0 0
- farml103.jlab.org
closed_Full
2 3 5 3 2 0 0
- farml104.jlab.org
closed_Full
2 3 3 3 0 0 0
- farml109.jlab.org closed_LIM 2 3 3 3 0 0 0
- farml148.jlab.org closed_Adm 2 3 4 3 1 0 0
- farml149.jlab.org closed_Adm 2 3 4 2 2 0 0
- farml15.jlab.org
closed_Full
3 4 7 4 3 0 0
- farml150.jlab.org closed_Adm 2 3 3 3 0 0 0
- farml147.jlab.org unavail 2 3 1 1 0 0 0
- farml151.jlab.org unreach 2 3 2 2 0 0 0
|
|
28
|
- jsub
- Batch job submission command
- Takes a command file as an argument
- The command file describes the job(s)
- Command File Keywords
- http://cc.jlab.org/docs/scicomp/how-to/keywords.html
|
|
29
|
|
|
30
|
|
|
31
|
|
|
32
|
|
|
33
|
|
|
34
|
|
|
35
|
- Submission
- Dependencies
- jcache jobs will be created if the input files are stub file (files on
tape).
- Sorry, no easy way currently exists to map jcache jobs to their
matching jcache farm job.
- The job or jobs will not start until the jcache exits.
- A jcache is created for every 10 input files
- Available farm nodes
- 3 jobs per node
- 2 jobs per node per user
|
|
36
|
- Execution
- Monitor CPU usage with jobstat
- Default shell is /bin/csh
- Environment Variables
- WORKDIR
- JOB_ID
- JOB_SEQUENCE
- Completion
- LSF emails the user after each job is completed
- The email contains STDOUT messages
- All users in the MAIL list get a summary email from the JobServer that
the jobs are done.
|