Skip to content

Producing MicroAODS for data

simonepigazzini edited this page Jun 15, 2016 · 15 revisions

Producing MicroAODs for data

General instructions on microAOD production are found on the repository README and in the flashgg documentation. This focuses on the production of microAODs for data.

  1. Environment

    cmsenv
    CAMPAIGN=EXOSpring16_v1
    PART=<incremental number>
    MY_GRID_USER=`whoami` # or your grid user name if differs from unix account name
    
  2. Make a snapshot of the existing dataset catalog.

    TMP_CATALOG=${CAMPAIGN}_p${PART}_temp
    fggManageSamples.py -m diphotons -C ${TMP_CATALOG}  catimport diphotons:${CAMPAIGN} \*Run2016\*
    
  3. Prepare work area in flashgg

    cd ${CMSSW_BASE}/src/flashgg/MetaData/work
    ln -sf  ${CMSSW_BASE}/src/diphotons/MetaData/work/analysis_microAOD.py .
    # allow running on invalid datasets
    cat crabConfig_TEMPLATE.py > mycrabConfig_TEMPLATE.py
    cat >> mycrabConfig_TEMPLATE.py << EOF
    
    ## config.Data.allowNonValidInputDataset=True
    EOF
    
  4. Set-up crab

    source /cvmfs/cms.cern.ch/crab3/crab.sh
    voms-proxy-init --voms cms --valid 168:00
    
  5. Prepare a target JSON file for the processing.

    cd ${CMSSW_BASE}/src/flashgg/MetaData/work
    ./fggCookJson.py --field 3.8T --dqm-folder /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV/ --bunch-space ''
    

This will produce files called myjson_DCSONLY3.8T-<RUNNUM>.txt and myjson_3.8T__<RUNNUM>.txt. The first contains the latest part of the DCS-only JSON coming after the last certified run number. The second will be the or of the latest certification JSON and the first file.

  1. Find out the list of lumi sections to be processed for each dataset.

    ## remove previous config
    rm all_missing.json
    
    ./fggRollingDataset.py --target myjson_3.8T__\*.txt --dataset <dataset_1> --catalog diphotons:${TMP_CATALOG}
    ./fggRollingDataset.py --target myjson_3.8T__\*.txt --dataset <dataset_2> --catalog diphotons:${TMP_CATALOG}
    ...
    

    This will create a folder for each dataset, containing three files: target.json, processed.json and missing.json. The first contains the subset of the target JSON file contained in each dataset, determined filtering the target JSON to be between the minimum and maximum run number in each dataset.
    The file all_missing.json is meant to be loaded by prepareCrabJobs.py and contains the list of dataset to be processed with the corresponding missing.json lumi mask.
    Note: The list of primary datasets to be processed is /SinglePhoton, /SingleElectron and /DoubleEG. Please double check in DAS the list of secondary datasets.
    Currently the list of secondary datasets is Run2016B-PromptReco-v1 and Run2016B-PromptReco-v2

  2. Edit all_missing.json adding empty signal and background dataset lists "sig" : [], "bkg" : [].

  3. Prepare crab configurations

    ./prepareCrabJobs.py -L 10 -C ${CAMPAIGN}_p${PART} -s all_missing.json -p analysis_microAOD.py
    
  4. Launch production

    cd ${CAMPAIGN}_p${PART}
    parallel --ungroup 'crab sub {} | tee {}.log' ::: *.py # or explicit list of configs to run
    
  5. Monitor production and update catalog, preparing a script to be run continuosly in screen.

    echo ${CAMPAIGN}_p${PART}/crab_*/ > running_tasks.txt
    cat > mon.sh << EOF
    #!/bin/bash
    
    # import files from DBS
    fggManageSamples.py -m diphotons -C ${TMP_CATALOG} import '/*/*${MY_GRID_USER}*${CAMPAIGN}_p${PART}*/USER'
    
    # run check jobs
    fggManageSamples.py -m diphotons -C ${TMP_CATALOG} check -q 8nm
    
    # resubmit possibly failed jobs
    cat running_tasks.txt |  tr ' ' '\n' | parallel -j 6 'crab resubmit {}'
    EOF
    
    chmod 755 mon.sh
    
    while [[ 1==1 ]]; do ./mon.sh; sleep 360; done
    
  6. Once production is over, import new datasets in catalog. Be aware that p3 has duplicates so fggManageSample will camplain about that, say "yes" to all request in order to keep all the files.

    fggManageSamples.py -m diphotons -C ${CAMPAIGN}  catimport diphotons:${TMP_CATALOG} \*Run2016\*
    fggManageSamples.py -m diphotons -C ${CAMPAIGN}  check
    fggManageSamples.py -m diphotons -C ${CAMPAIGN} overlap /<dataset>*
    

    where stands for DoubleEG, SinglePhoton and SingleElectron. Manually check that there are no overlaps between different parts (an empty json should be produced for each comparison pair '{}').

  7. Commit new catalog and make pull request.

    cd ${CMSSW_BASE}/src/diphotons
    git checkout -b production_${CAMPAIGN}_p${PART}
    
    git add MetaData/data/${CAMPAIGN}/datasets.json
    git commit
    
    MY_GITHUB_NAME=$(git config --get user.github)
    git remote add ${MY_GITHUB_NAME} git@github.com:${MY_GITHUB_NAME}/diphotons.git
    git push -u ${MY_GITHUB_NAME} production_${CAMPAIGN}_p${PART}