long run on cluster

Asked by lorenzo marafatto

hi all
I have a problem executing a long process
I use madgraph and the runnig of the simulation takes one day or more
I set the ServerAliveInterval 30 in .ssh\config
but it is not enough: I get sometimes a broken pipe and so I must restart from the beginning
how can I do?
many thanks

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Olivier Mattelaer
Solved:
Last query:
Last reply:
Revision history for this message
Best Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

I typically use one of the following solutions:
1) use "screen" to avoid such issue
2) submit the main job on the cluster itself (such that I have one core dedicated for controlling the rest of the application)
(in that case you have to request a single core for your main executable and be sure that this is fine for your cluster allowed node to submit jobs (typically this is allowed)

Other solution:
1) use nohup method.
2) use tmux

Cheers,

Olivier

Revision history for this message
lorenzo marafatto (lmaraf) said :
#2

Thanks Olivier Mattelaer, that solved my question.

Revision history for this message
lorenzo marafatto (lmaraf) said (last edit ):
#3
Revision history for this message
lorenzo marafatto (lmaraf) said :
#4

I tried all these ways but also the last process was stopped and screen has been disconnected and therefore I have no result...
any suggestion on how I could a long process via ssh?
Thanks so much!
Lorenzo

Il martedì 17 gennaio 2023 07:15:50 CET, lorenzo marafatto <email address hidden> ha scritto:

Your question #704421 on MadGraph5_aMC@NLO changed:
https://answers.launchpad.net/mg5amcnlo/+question/704421

lorenzo marafatto posted a new comment:
Hi
I tried all of these ways but the program still closes after some time
maybe I could run somehow in a batch mode?
many thanks again
Lorenzo Marafatto

Il lunedì 16 gennaio 2023 15:50:37 CET, Olivier Mattelaer <email address hidden> ha scritto:

Your question #704421 on MadGraph5_aMC@NLO changed:
https://answers.launchpad.net/mg5amcnlo/+question/704421

    Status: Open => Answered

Olivier Mattelaer proposed the following answer:
Hi,

I typically use one of the following solutions:
1) use "screen" to avoid such issue
2) submit the main job on the cluster itself (such that I have one core dedicated for controlling the rest of the application)
(in that case you have to request a single core for your main executable and be sure that this is fine for your cluster allowed node to submit jobs (typically this is allowed)

Other solution:
1) use nohup method.
2) use tmux

Cheers,

Olivier

--
If this answers your question, please go to the following page to let us
know that it is solved:
https://answers.launchpad.net/mg5amcnlo/+question/704421/+confirm?answer_id=0

If you still need help, you can reply to this email or go to the
following page to enter your feedback:
https://answers.launchpad.net/mg5amcnlo/+question/704421

You received this question notification because you asked the question.

You received this question notification because you asked the question.