mpi cluster error

Asked by Amin Aboubrahim on 2017-11-10

Hi Olivier,

I have made the necessary changes on the class OneCore to match my cluster. I set the mpirun as:
mpirun -n 72 --hostfile ./myhostfile
where myhostfile contains the nodes. When I tried to run to get the gridpack, I get this message:

Plugin PLUGIN.mpi_plugin has marked as NOT being validated with this version.
It has been validated for the last time with version: 2.5.3

followed by:

INFO: cluster handling will be done with PLUGIN: mpi_plugin
####################### class OneCore
INFO: cluster handling will be done with PLUGIN: mpi_plugin
####################### class OneCore
generate_events -f
Generating gridpack with run name run_01
survey run_01 --accuracy=0.01 --points=2000 --iterations=8 --gridpack=.true.
INFO: compile directory
compile Source Directory
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO: P1_gg_ttxg
/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/survey.sh
mpirun -n 72 --hostfile ./myhostfile
INFO: P1_gq_ttxq
rm: cannot remove ‘results.dat’: No such file or directory
ERROR DETECTED
INFO: P1_qq_ttxg
rm: cannot remove ‘results.dat’: No such file or directory
ERROR DETECTED
INFO: P0_gg_ttx
rm: cannot remove ‘results.dat’: No such file or directory
ERROR DETECTED
INFO: P0_qq_ttx
rm: cannot remove ‘results.dat’: No such file or directory
ERROR DETECTED
INFO: Idle: 0, Running: 1, Completed: 4 [ current time: 16h39 ]
rm: cannot remove ‘results.dat’: No such file or directory
ERROR DETECTED
INFO: Idle: 0, Running: 0, Completed: 5 [ 0.12s ]
INFO: Idle: 0, Running: 0, Completed: 5 [ 0.12s ]
Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_gg_ttxg/G1/results.dat
Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_gq_ttxq/G1/results.dat
Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_qq_ttxg/G1/results.dat
Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P0_gg_ttx/G1/results.dat
Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P0_qq_ttx/G1/results.dat
Command "generate_events -f" interrupted with error:
ValueError : need more than 5 values to unpack
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/run_01_tag_1_debug.log'.
Please attach this file to your report.

If necessary, I can email you the log file of the run.
It might be a problem from the cluster so I just to make sure it is not coming from Madgraph because of that message at the beginning related to the validation of the plugin.

Thank you very much.
Amin

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2017-11-10
Last reply:
2017-11-10

Hi,

> Plugin PLUGIN.mpi_plugin has marked as NOT being validated with this version.
> It has been validated for the last time with version: 2.5.3

This I would not worry too much.

What is the content of the log file in the following directory:

> /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_gg_ttxg/G1/

This should give you the hint of what the problem is.

Cheers,

Olivier

> On Nov 10, 2017, at 01:47, Amin Aboubrahim <email address hidden> wrote:
>
> Question #660547 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/660547
>
> Description changed to:
> Hi Olivier,
>
> I have made the necessary changes on the class OneCore to match my cluster. I set the mpirun as:
> mpirun -n 72 --hostfile ./myhostfile
> where myhostfile contains the nodes. When I tried to run to get the gridpack, I get this message:
>
> Plugin PLUGIN.mpi_plugin has marked as NOT being validated with this version.
> It has been validated for the last time with version: 2.5.3
>
> followed by:
>
> INFO: cluster handling will be done with PLUGIN: mpi_plugin
> ####################### class OneCore
> INFO: cluster handling will be done with PLUGIN: mpi_plugin
> ####################### class OneCore
> generate_events -f
> Generating gridpack with run name run_01
> survey run_01 --accuracy=0.01 --points=2000 --iterations=8 --gridpack=.true.
> INFO: compile directory
> compile Source Directory
> Using random number seed offset = 21
> INFO: Running Survey
> Creating Jobs
> Working on SubProcesses
> INFO: P1_gg_ttxg
> /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/survey.sh
> mpirun -n 72 --hostfile ./myhostfile
> INFO: P1_gq_ttxq
> rm: cannot remove ‘results.dat’: No such file or directory
> ERROR DETECTED
> INFO: P1_qq_ttxg
> rm: cannot remove ‘results.dat’: No such file or directory
> ERROR DETECTED
> INFO: P0_gg_ttx
> rm: cannot remove ‘results.dat’: No such file or directory
> ERROR DETECTED
> INFO: P0_qq_ttx
> rm: cannot remove ‘results.dat’: No such file or directory
> ERROR DETECTED
> INFO: Idle: 0, Running: 1, Completed: 4 [ current time: 16h39 ]
> rm: cannot remove ‘results.dat’: No such file or directory
> ERROR DETECTED
> INFO: Idle: 0, Running: 0, Completed: 5 [ 0.12s ]
> INFO: Idle: 0, Running: 0, Completed: 5 [ 0.12s ]
> Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_gg_ttxg/G1/results.dat
> Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_gq_ttxq/G1/results.dat
> Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P1_qq_ttxg/G1/results.dat
> Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P0_gg_ttx/G1/results.dat
> Error when reading /GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/P0_qq_ttx/G1/results.dat
> Command "generate_events -f" interrupted with error:
> ValueError : need more than 5 values to unpack
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/run_01_tag_1_debug.log'.
> Please attach this file to your report.
>
>
> If necessary, I can email you the log file of the run.
> It might be a problem from the cluster so I just to make sure it is not coming from Madgraph because of that message at the beginning related to the validation of the plugin.
>
> Thank you very much.
> Amin
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Amin Aboubrahim (amin83) said : #2

Hi,

This directory is empty, as all other G's.

This is part of the content of run_01_tag_1_debug.log:

generate_events -f
Traceback (most recent call last):
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/extended_cmd.py", line 1438, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/extended_cmd.py", line 1392, in onecmd_orig
    return func(arg, **opt)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/madevent_interface.py", line 2111, in do_generate_events
    postcmd=False)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/extended_cmd.py", line 1465, in exec_cmd
    stop = Cmd.onecmd_orig(current_interface, line, **opt)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/extended_cmd.py", line 1392, in onecmd_orig
    return func(arg, **opt)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/madevent_interface.py", line 2964, in do_survey
    cross, error = self.make_make_all_html_results()
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/common_run_interface.py", line 672, in make_make_all_html_results
    return sum_html.make_all_html_results(self, folder_names, jobs)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/sum_html.py", line 742, in make_all_html_results
    Presults = collect_result(cmd, folder_names=folder_names, jobs=jobs)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/sum_html.py", line 710, in collect_result
    P_comb.add_results(name, pjoin(P_path,name,'results.dat'), mfactor)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/sum_html.py", line 412, in add_results
    oneresult.read_results(filepath)
  File "/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/bin/internal/sum_html.py", line 306, in read_results
    self.xsec = data[:10]
ValueError: need more than 5 values to unpack

Thank you,
Amin

Hi,

Then you should check this script:
/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/testrun/SubProcesses/survey.sh
to see if you spot anything wrong with it in the way it submit on your mpi machine.

if you don't, then you need to print the argument with which that script is called.
(by printing *args, and **opts in the onecore class.

and then run that script manually. If it crash without producing anything usefull. you have to edit the script to avoid all log redirection trough /dev/null or to a file.

Cheers,

Olivier

Amin Aboubrahim (amin83) said : #4

Thank you Olivier.
I will follow your tips to try and make it work.

All the best,
Amin

Can you help with this problem?

Provide an answer of your own, or ask Amin Aboubrahim for more information if necessary.

To post a message you must log in.