Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP Task Implementation #525

Open
eZWALT opened this issue Mar 30, 2024 · 11 comments
Open

OpenMP Task Implementation #525

eZWALT opened this issue Mar 30, 2024 · 11 comments

Comments

@eZWALT
Copy link
Contributor

eZWALT commented Mar 30, 2024

I'm opening this issue to consult some implementation details with the community before proceding with a possible pull request. I will add another class called CIRClauseProcessor to avoid repetition of clause treatment inside the OpenMP statements code generation and also some additional methods to generate the body to also avoid redundancy. I've gathered some inspiration by how OpenMP is being implemented in flang, given that they also use MLIR.

Furthermore, I have a local version that implements scopes and untied/mergeable clauses, but one important issue that I should address is that ClangIR has no support for integers of bit width 1, and I pretty much need them for some clauses inside of omp task such as final, if ...

mlir::omp::TaskOp operands and attributes:
image

I've tried some workarounds, such as one below, but still produces compilation errors due to this type not being supported on ClangIR.

  //auto uint1Ty = mlir::cir::IntType::get(builder.getContext(),/*Bit Width*/ 1,/*Signed?*/false ); 

The error i've been experiencing is the following (Obviously ClangIR only supports a bit width of 8,16,32...)
image

If i try to use mlir::IntegerType of bit width 1, I get the following error due to "cir.const" operation not supporting this type

image

So the question that is to be determined is: Should I implement support for 1 bit integers (I don't think this is a good approach though) or should I use mlir.const operations instead of cir.const (This may be problematic too, due to the mixture of mlir and cir). I'm not completely sure on how to proceed, so your experience and advice would be of great help, thank you!

@bcardosolopes
Copy link
Member

bcardosolopes commented Apr 3, 2024

Hi @eZWALT, these are great questions, since it exposes the challenges of mixing dialects. Can you write down what your input source file looks like and how you expect the dialect mix output to look like? (Or point out the options in this imaginary example?). It might be easier to discuss based on an actual mockup.

@eZWALT
Copy link
Contributor Author

eZWALT commented Apr 8, 2024

My apologies for not explaining with enough detail, I was facing errors due to using constants in compile-time, but now I'm facing issues related to the build operations returning CIR types instead of MLIR types .This is the basic processing of the clauses to get more context (Now outdated since I've built a CIRClauseProcessor class to avoid redundant code, you will get to see it in the PR):

if (S.hasClausesOfKind<OMPUntiedClause>())
    untiedAttr = builder.getUnitAttr();
if (S.hasClausesOfKind<OMPMergeableClause>())
    mergeableAttr = builder.getUnitAttr();

// Evaluates and set operands: the priority, if, and final clauses
if (S.hasClausesOfKind<OMPFinalClause>()) {
    const OMPFinalClause *finalClause = S.getSingleClause<OMPFinalClause>();
    const clang::Expr *condExpr = finalClause->getCondition();
    finalOperand = evaluateExprAsBool(condExpr);
}

if (S.hasClausesOfKind<OMPIfClause>()) {
    const OMPIfClause *ifClause = S.getSingleClause<OMPIfClause>();
    const clang::Expr *condExpr = ifClause->getCondition();
    ifOperand = evaluateExprAsBool(condExpr);
}

if (S.hasClausesOfKind<OMPPriorityClause>()) {
    const OMPPriorityClause *priorityClause = S.getSingleClause<OMPPriorityClause>();
    const clang::Expr *priorityExpr = priorityClause->getPriority();
    priorityOperand = buildScalarExpr(priorityExpr);
}

image

I'm reusing the already written code that generates the boolean and scalar expressions, such as ** buildScalarExpr ** and ** evaluateExprAsBool** , but I'm facing troubles with the operands' priority, if and final, since this must use mlir types and not cir types. The approach that I thought that can solve this problem would be to add a new conversion operation (I'm well aware after reading the codegen classes in further detail, that I could achieve this using a modified version of buildScalarConversion, given that it only supports conversion of Clang QualType's).

So I propose, for the sake of simplicity, extending or duplicating buildScalarConversion to support MLIR types. I've seen that Flang follows a similar approach, they have a conversion operation too, to cast these FIR types into MLIR, so that me think that this approach may be feasible. However, this has a downside, the instruction overhead, a new redundant instruction is created.

I lack sufficient experience with MLIR and familiarity with ClangIR as a project to confidently choose this direction. I'm open to exploring other approaches that might be equally effective or even better suited to address my issue. I would greatly appreciate hearing the opinions of @fabianmcg and @kiranchandramohan on this matter. Thank you to everyone for your input!"

@bcardosolopes
Copy link
Member

The approach that I thought that can solve this problem would be to add a new conversion operation

That's a possibility, are you aware of https://mlir.llvm.org/docs/Dialects/Builtin/#builtinunrealized_conversion_cast-unrealizedconversioncastop ?

I lack sufficient experience with MLIR and familiarity with ClangIR as a project to confidently choose this direction

Can you paste the original source file you are trying to build (I'm assuming it's a very basic one) and write down what CIR do you imagine being generated? I think I can provide more design help or insights if I can look at that.

@eZWALT
Copy link
Contributor Author

eZWALT commented Apr 21, 2024

I wasn't aware of this unrealized conversion, now after implementing it and testing it I can say that this is far more practical and cleaner than building a new conversion operation from scratch. Now, I will put forward the original source file that I'm using as a basic example for testing the final clause (Remember that the treatment of final, if and priority clauses is symmetrical):

int main(){
        int b = 3;
        int a = -3;
        #pragma omp parallel 
        {
                int condition = (a+b) == 0;
                #pragma omp task final(condition)
                {
                        int a = 33;
                        int * c = &a;
                }
        }
}

The output after compiling this file with the basic flags (-fopenmp -fclangir-enable -emit-cir) is the following (which is almost the same that I imagined on my head when writing the code generation):

  cir.func no_proto @main() -> !s32i extra(#fn_attr) {
    %0 = cir.alloca !s32i, cir.ptr <!s32i>, ["__retval"] {alignment = 4 : i64} loc(#loc2)
    %1 = cir.alloca !s32i, cir.ptr <!s32i>, ["b", init] {alignment = 4 : i64} loc(#loc23)
    %2 = cir.alloca !s32i, cir.ptr <!s32i>, ["a", init] {alignment = 4 : i64} loc(#loc24)
    %3 = cir.const(#cir.int<3> : !s32i) : !s32i loc(#loc4)
    cir.store %3, %1 : !s32i, cir.ptr <!s32i> loc(#loc23)
    %4 = cir.const(#cir.int<3> : !s32i) : !s32i loc(#loc6)
    %5 = cir.unary(minus, %4) : !s32i, !s32i loc(#loc7)
    cir.store %5, %2 : !s32i, cir.ptr <!s32i> loc(#loc24)
    omp.parallel {
      cir.scope {
        %7 = cir.alloca !s32i, cir.ptr <!s32i>, ["condition", init] {alignment = 4 : i64} loc(#loc26)
        %8 = cir.load %2 : cir.ptr <!s32i>, !s32i loc(#loc12)
        %9 = cir.load %1 : cir.ptr <!s32i>, !s32i loc(#loc13)
        %10 = cir.binop(add, %8, %9) : !s32i loc(#loc27)
        %11 = cir.const(#cir.int<0> : !s32i) : !s32i loc(#loc11)
        %12 = cir.cmp(eq, %10, %11) : !s32i, !s32i loc(#loc28)
        cir.store %12, %7 : !s32i, cir.ptr <!s32i> loc(#loc26)
        %13 = cir.load %7 : cir.ptr <!s32i>, !s32i loc(#loc15)
        %14 = cir.cast(int_to_bool, %13 : !s32i), !cir.bool loc(#loc15)
        %15 = builtin.unrealized_conversion_cast %14 : !cir.bool to i1 loc(#loc29)
        omp.task final(%15) {
          cir.scope {
            %16 = cir.alloca !s32i, cir.ptr <!s32i>, ["a", init] {alignment = 4 : i64} loc(#loc30)
            %17 = cir.alloca !cir.ptr<!s32i>, cir.ptr <!cir.ptr<!s32i>>, ["c", init] {alignment = 8 : i64} loc(#loc31)
            %18 = cir.const(#cir.int<33> : !s32i) : !s32i loc(#loc19)
            cir.store %18, %16 : !s32i, cir.ptr <!s32i> loc(#loc30)
            cir.store %16, %17 : !cir.ptr<!s32i>, cir.ptr <!cir.ptr<!s32i>> loc(#loc31)
          } loc(#loc29)
          omp.terminator loc(#loc17)
        } loc(#loc29)
      } loc(#loc25)
      omp.terminator loc(#loc9)
    } loc(#loc25)
    %6 = cir.load %0 : cir.ptr <!s32i>, !s32i loc(#loc2)
    cir.return %6 : !s32i loc(#loc2)
  } loc(#loc22)
} loc(#loc)

Now the only issue is that if I want to compile to a backend, for example to LLVM, I get a crash (Makes sense given that I haven't implemented how this conversion is actually performed) like the following:

clang-19 -fopenmp -fclangir-enable  -S -emit-llvm 9_task_final.c -o -
loc(fused["9_task_final.c":8:3, "9_task_final.c":8:36]): error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
fatal error: error in backend: The pass manager failed to lower CIR to LLVMIR dialect!

So my right now I'm kind of clueless of how could I implement this lowering, or maybe I misunderstood how unrealized_conversion_cast works and it serves as a placeholder. Thanks for reading and I'm eager to hear your feedback!

@bcardosolopes
Copy link
Member

Thanks for pasting the example, I think it makes sense too.

So my right now I'm kind of clueless of how could I implement this lowering, or maybe I misunderstood how unrealized_conversion_cast works and it serves as a placeholder.

It's possible that this requires adding something like target->addLegalOp<UnrealizedConversionCastOp>();, as I see others doing (e.g. MathToSPIRVPass.cpp, etc). Once that's done, my expectation is that this op will become legal and will show up in the final LLVM IR dialect, with all types in terms of LLVM, and later folded away.

@eZWALT
Copy link
Contributor Author

eZWALT commented Apr 27, 2024

Great, then I will try it out today later on. Now moving up to dependency management inside the task directive, I need some way to get the address of a local/global variable inside a mlir::Value, in order to process dependencies between tasks. Being more specific, I require a way to pragma omp task depend(in: X) depend(out: Y) get the addresses of variables X and Y. I' aware that CIR has this operation "get_global" (https://llvm.github.io/clangir/Dialect/ops.html#cirget_global-cirgetglobalop) which returns the address that points to the specified symbol. To complete dependency management, I would need to perform a similar operation. I'm saying this because I guess that the static information that I can get from the AST DeclRefExpr is not enough. Below, I've attached the lambda function that I'm trying to build to get both the type and address of the depend clause:

    [&](const clang::OMPDependClause* clause){
      //Get the depend type
      mlir::omp::ClauseTaskDependAttr dependType = getDependKindAttr(
        this->builder, clause
      )
      //Get an mlir value of the address of the depend variable
      const mlir::Value variable = builder.create<cir::get_local> ???
      result.dependVars.append(variable);
      result.dependTypeAttrs.append(dependType);
    }

So my question is, how could I achieve this? Any ideas are welcome!

@bcardosolopes
Copy link
Member

bcardosolopes commented Apr 29, 2024

I require a way to pragma omp task depend(in: X) depend(out: Y) get the addresses of variables X and Y.

Are this required to be globals? or could it be address coming from local variables?

I suggest you write code in C/C++ and check how we currently do things in lib/CIR/CodeGen by running a simple test under the debugger.

Btw, this is an example of a global lowered to CIR: https://godbolt.org/z/6rs83saes. You could also grep the codebase for the operation name to see how it's usually constructed (.e.g git grep GetGlobalOp)

@eZWALT
Copy link
Contributor Author

eZWALT commented May 1, 2024

No, these variables can be both local or global, so that's why I was asking for a way to achieve the same as the GetGlobalOp for local variables or any kind of variable. But if there isn't a straightforward way to do this with operations actually, I will look up the actual implementation of GetGlobalOp, thanks !

@bcardosolopes
Copy link
Member

Local variables already have their value produced by cir.alloca and/or a series of casts that might get in the middle. Can you point me to the OG LLVM code you are trying to replicate here? Depending on what gets called you could call the same method and get transparently solved by other pre-existing CIRGen component

@eZWALT
Copy link
Contributor Author

eZWALT commented May 7, 2024

Okay, so take for instance this basic sample of task dependency usage:

int main(){
	int x = 5;
	int y = 413;

	#pragma omp parallel 
	{
		#pragma omp task depend(out: x,y)
		{
			++x;
		}
		#pragma omp task depend(in: x,y)
		{
			--x;
		}
		int * c = &x;
	}
}

Now, looking at the outlined main of its LLVM counterpart after compiling:

; Function Attrs: noinline norecurse nounwind optnone uwtable
define internal void @main.omp_outlined(ptr noalias noundef %.global_tid., ptr noalias noundef %.bound_tid., ptr noundef nonnull align 4 dereferenceable(4) %x, ptr noundef nonnull align 4 dereferenceable(4) %y) #1 {
entry:
  %.global_tid..addr = alloca ptr, align 8
  %.bound_tid..addr = alloca ptr, align 8
  %x.addr = alloca ptr, align 8
  %y.addr = alloca ptr, align 8
  %agg.captured = alloca %struct.anon, align 8
  %.dep.arr.addr = alloca [2 x %struct.kmp_depend_info], align 8
  %dep.counter.addr = alloca i64, align 8
  %agg.captured1 = alloca %struct.anon.0, align 8
  %.dep.arr.addr2 = alloca [2 x %struct.kmp_depend_info], align 8
  %dep.counter.addr3 = alloca i64, align 8
  %c = alloca ptr, align 8
  store ptr %.global_tid., ptr %.global_tid..addr, align 8
  store ptr %.bound_tid., ptr %.bound_tid..addr, align 8
  store ptr %x, ptr %x.addr, align 8
  store ptr %y, ptr %y.addr, align 8
  %0 = load ptr, ptr %x.addr, align 8
  %1 = load ptr, ptr %y.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %agg.captured, i32 0, i32 0
  store ptr %0, ptr %2, align 8
  %3 = load ptr, ptr %.global_tid..addr, align 8
  %4 = load i32, ptr %3, align 4
  %5 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %4, i32 1, i64 40, i64 8, ptr @.omp_task_entry.)
  %6 = getelementptr inbounds %struct.kmp_task_t_with_privates, ptr %5, i32 0, i32 0
  %7 = getelementptr inbounds %struct.kmp_task_t, ptr %6, i32 0, i32 0
  %8 = load ptr, ptr %7, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %8, ptr align 8 %agg.captured, i64 8, i1 false)
  %9 = getelementptr inbounds [2 x %struct.kmp_depend_info], ptr %.dep.arr.addr, i64 0, i64 0
  %10 = ptrtoint ptr %0 to i64
  %11 = getelementptr %struct.kmp_depend_info, ptr %9, i64 0
  %12 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 0
  store i64 %10, ptr %12, align 8
  %13 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 1
  store i64 4, ptr %13, align 8
  %14 = getelementptr inbounds %struct.kmp_depend_info, ptr %11, i32 0, i32 2
  store i8 3, ptr %14, align 8
  %15 = ptrtoint ptr %1 to i64
  %16 = getelementptr %struct.kmp_depend_info, ptr %9, i64 1
  %17 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 0
  store i64 %15, ptr %17, align 8
  %18 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 1
  store i64 4, ptr %18, align 8
  %19 = getelementptr inbounds %struct.kmp_depend_info, ptr %16, i32 0, i32 2
  store i8 3, ptr %19, align 8
  store i64 2, ptr %dep.counter.addr, align 8
  %20 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %4, ptr %5, i32 2, ptr %9, i32 0, ptr null)
  %21 = getelementptr inbounds %struct.anon.0, ptr %agg.captured1, i32 0, i32 0
  store ptr %0, ptr %21, align 8
  %22 = call ptr @__kmpc_omp_task_alloc(ptr @1, i32 %4, i32 1, i64 40, i64 8, ptr @.omp_task_entry..2)
  %23 = getelementptr inbounds %struct.kmp_task_t_with_privates.1, ptr %22, i32 0, i32 0
  %24 = getelementptr inbounds %struct.kmp_task_t, ptr %23, i32 0, i32 0
  %25 = load ptr, ptr %24, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %25, ptr align 8 %agg.captured1, i64 8, i1 false)
  %26 = getelementptr inbounds [2 x %struct.kmp_depend_info], ptr %.dep.arr.addr2, i64 0, i64 0
  %27 = ptrtoint ptr %0 to i64
  %28 = getelementptr %struct.kmp_depend_info, ptr %26, i64 0
  %29 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 0
  store i64 %27, ptr %29, align 8
  %30 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 1
  store i64 4, ptr %30, align 8
  %31 = getelementptr inbounds %struct.kmp_depend_info, ptr %28, i32 0, i32 2
  store i8 1, ptr %31, align 8
  %32 = ptrtoint ptr %1 to i64
  %33 = getelementptr %struct.kmp_depend_info, ptr %26, i64 1
  %34 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 0
  store i64 %32, ptr %34, align 8
  %35 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 1
  store i64 4, ptr %35, align 8
  %36 = getelementptr inbounds %struct.kmp_depend_info, ptr %33, i32 0, i32 2
  store i8 1, ptr %36, align 8
  store i64 2, ptr %dep.counter.addr3, align 8
  %37 = call i32 @__kmpc_omp_task_with_deps(ptr @1, i32 %4, ptr %22, i32 2, ptr %26, i32 0, ptr null)
  store ptr %0, ptr %c, align 8
  ret void
}

Despite the differences with original Clang code generation (The capture list variables addresses are placed inside an struct which will be passed down to the openmp calls), I want to do something similar with MLIR. Basically, I need to obtain the addresses of both X,Y variables. After looking at the Clang AST produced (Compiler Explorer AST ), I've concluded that I should do a "load address"-like operation (&x), which I think it's LValue expr in C++ parlance (Correct me if I'm wrong, probably I'm and its far more complex).

Looking at the Code Gen functions that are currently supported in CIR (consulting CIRGenFunction.h), I was looking to a load operation and came to the conclusion that a combination of buildLValue and some other operation like buildLoadOfLValue that returns an mlir::Value of the address of the LValue. I'm aware that the last operation i've mentioned actually returns an RValue. Maybe I could statically assign the value of lvalue.getAddress() to an mlir value, but I don't think that would be correct. I'm probably missing something because I'm overcomplicating a simple address access.

Thank you and sorry for the long message!

@eZWALT
Copy link
Contributor Author

eZWALT commented May 13, 2024

Gentle ping 👉@bcardosolopes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants