Wayne State University

AIM HIGHER

Wayne State University

High Performance Computing Services

Troubleshooting

TROUBLESHOOTING

Q: 'ERROR MESSAGE'

A: 'EXPLANATION'

COA: 'RECOMMENDED COURSE OF ACTION'
Q: I get this error when I try to log on to the Grid,

Connecting to grid.wayne.edu...
ssh_exchange_identification: Connection closed by remote host
Couldn't read packet: Connection reset by peer

A: This error was reported while the Grid was experiencing network problems due to a hardware failure.

COA: Try again in fifteen minutes. If you are still receiving an error then email michael@wayne.edu to report the problem.
Q: My graphical application will not connect, what could be wrong?

A: This could be a firewall problem.

COA: If there is a firewall, check to make sure that your computer has access to X11 traffic. If you are an Engineering Student at Wayne State University, apply for a public IP address and firewall access for X11 traffic by filling out the Firewall Request form available here: http://ats.eng.wayne.edu/
Q: My Connection Key has been changed, should I be worried?

A: No, do not worry. What is going on? Your computer has kept a copy of the connection key in its memory from the last time it accessed the Grid. When parts of the Grid are upgraded the keys may change. Since the new key on the Grid does not match the one your computer has locally cached you are notified that it could be a possible security hazard. However, all is well, just re-save the new key and the message will go away. You may notice this message in the future as well but it will always be correlated with an upgrade to the Grid.

COA: Save the key and continue.
Q: I can ssh into grid.isc.wayne.edu but not into grid.wayne.edu. I get this error,

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The RSA host key for grid.wayne.edu has changed, and the key for the
according IP address 141.217.32.125 is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host and its host key
have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the- middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
4c:6f:d4:52:3f:9d:30:59:28:07:00:d6:94:a0:37:4e.
Please contact your system administrator.
Add correct host key
in /export/home1/YOURUSERNAME/.ssh/known_hosts to get rid of this message.
Offending key in /export/home/abhishek/.ssh/known_hosts:18
RSA host key for grid.wayne.edu has changed and you have requested strict
checking.
Host key verification failed.

What should I do?

A: This is a configuration issue with your client software and not a problem with the Grid.

COA: Either delete the directory /export/home/YOURUSERNAME/.ssh/ to get rid of old keys which no longer exist, or turn off strict checking in your SSH configuration as the error message indicates. If you still cannot connect you should either contact your local system administrator and ask them to fix the problem or try connecting from a PC using the software referenced on the Grid website.
Q: My openMP code will only run on 1 processor with 1 thread. How do I run openMP code on 4 nodes with 2 threads per node?

A: OpenMP code does not communicate between nodes or run on multiple nodes. MPI communicates between nodes. If you are submitting your code as a multi-node job it will fail in PBS.

COA: Use MPI to run a program on multiple nodes and use OpenMP to spawn multiple threads.
Q: I received the following error after submitting a job,

/var/spool/PBS/mom_priv/jobs/XXXXX.pbs.SC: line 13: /tmp/pbs.XXXXX.pbs/hello: cannot execute binary file

A: You may be trying to run 64-bit code on a 32-bit node.

COA: Make sure that your executable matches the architecture. Use the file command, type: file codename (codename stands for the name of the file to be run.) This will tell you if the code file is 32-bit or 64-bit. Then refer to the diagram on the PBS TUTORIAL page to check the architecture of the queues.
Q: I get this error when I try to copy, modify, move, or remove files,

[zz5555@Commodore ~]$ cp cp_source cp_test
cp: closing `cp_test': Disk quota exceeded

[zz5555@Commodore ~]$ rm cp_test
rm: cannot remove `cp_test': Disk quota exceeded

A: This error occurrs when you have used your alotment of requested disk space. HOW CAN I CHECK MY DISK QUOTA?

COA: Request more disk space by visiting the Grid Account Application page and increasing your requested disk space amount. Once your quota has been increased, erase unnecessary files and make a habit of monitoring your disk space usage. You may visit the DISK QUOTA TUTORIAL if you need more information.
Q: I get this error in the error file when I run my job script,

/var/spool/PBS/mom_priv/jobs/236610.pbs.SC: line 1: ÿþ#: command not found

/var/spool/PBS/mom_priv/jobs/195447.vpbs1.SC: /bin/bash^M: bad interpreter: No such file or directory

A: It is likely that you wrote your jobscript in Wordpad/ Microsoft Word and encoded it as Unicode. This will produce errors, as the script will contain extra special characters. You may also have copied and pasted the script from those word processors, which could also produce the special characters.

COA: Check to see if your script has extra characters. Type: vi -b jobscriptname. If it does have special characters, then rewrite your job script directly in nano or vim. See the tutorials here: nano and vim. If you choose to write it elsewhere and use ftp to bring it to the grid, then make sure that the encoding is not Unicode and avoid Microsoft Word and Wordpad. You can also use the dos2unix command to change the encoding. Type: dos2unix jobscriptname. This will take away the ^M characters.

Q: My process was killed and/or has an Exit Code of 137.

A: More than likely the job used more memory than you requested.

COA: You can check to make sure this is the case by typing the following command:

qstat -xf $PBS_JOBID

where $PBS_JOBID is the job number for the job that died. You will then want to compare the resources_used_vmem value to the Resource_List.mem value. If the two values are the same then you know that your job used all of the available memory.

The fix is to request more memory. The default amount of memory assigned to a job is 1GB. You can visit the tutorial section of the website to learn about submitting jobs correctly. How to Run Jobs on the Grid (Step-by-Step)

Computer Center Services * 5925 Woodward Avenue Detroit, Michigan 48202