Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext/bcmath: Changed the bcmul calculation method #14213

Merged
merged 15 commits into from
May 20, 2024

Conversation

SakiTakamachi
Copy link
Member

@SakiTakamachi SakiTakamachi commented May 13, 2024

There are still ways to make it faster, but since it would complicate the PR, I will split it into a follow-up PR (#14229).

The difference is quite large, so I will only post rough measurements.

Bench code:

<?php

// 1
$num1 = '1.2345678';
$num2 = '2.1234567';

$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
    bcmul($num1, $num2, 20);
}
echo microtime(true) - $start . PHP_EOL;


// 2
$num1 = str_repeat('1234567890', 300);
$num2 = str_repeat('9876543210', 300);

$start = microtime(true);
for ($i = 0; $i < 1000; $i++) {
    bcmul($num1, $num2, 20);
}
echo microtime(true) - $start . PHP_EOL;


// 3
$num1 = '1.2345678901234567890';
$num2 = '2.12345678901234567890';

$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
    bcmul($num1, $num2, 20);
}
echo microtime(true) - $start . PHP_EOL;


// 4
$num1 = '-12345678901234567890.12345678901234567890';
$num2 = '-212345678901234567890.12345678901234567890';

$start = microtime(true);
for ($i = 0; $i < 2000000; $i++) {
    bcmul($num1, $num2, 20);
}
echo microtime(true) - $start . PHP_EOL;


// 5
$num1 = '12345678901234567890.00000000000000000000000000000000000000000000000000';
$num2 = '212345678901234567890.00000000000000000000000000000000000000000000000000';

$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
    bcmul($num1, $num2, 20);
}
echo microtime(true) - $start . PHP_EOL;

before

1.6423609256744
1.0373518466949
3.7979400157928
2.4885659217834
4.256236076355

after

1.2081251144409
0.096983194351196
1.9056849479675
0.60840702056885
2.0678749084473

@SakiTakamachi
Copy link
Member Author

CI is red

Multiplication is performed after converting to unsigned long, resulting in faster calculations.
@SakiTakamachi
Copy link
Member Author

I need to change some of the constant names and comments.

@SakiTakamachi
Copy link
Member Author

done!

@SakiTakamachi
Copy link
Member Author

The red one in CI is openssl.

@SakiTakamachi
Copy link
Member Author

There were some corrections that were omitted, so I corrected the variable names and comments. I think everything else is probably fine.

@SakiTakamachi SakiTakamachi marked this pull request as draft May 14, 2024 13:29
@SakiTakamachi SakiTakamachi marked this pull request as ready for review May 14, 2024 13:30
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Some comments:

BC_UINT_T num = 0;
BC_UINT_T base = 1;

for (size_t i = 0; i < len; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once shared a fast way of doing this that I had implemented in a WIP commit: nielsdos@20783c3
Original source: https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html
The article claims it is faster than a regular for loop that multiplies by 10.
(note: I wouldn't use the SSE SIMD approach from the article, just the 4/8 byte fast non-SIMD approach)
However, let's not do that in this PR because this PR is large enough as-is.

Let u = u0 + u1*(b^n)
Let v = v0 + v1*(b^n)
Then uv = (B^2n+B^n)*u1*v1 + B^n*(u1-u0)*(v0-v1) + (B^n+1)*u0*v0
BC_UINT_T *buf = emalloc((n1_arr_size + n2_arr_size + prod_arr_size) * sizeof(BC_UINT_T));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocations with multiplications shouldn't be open-coded, because they risk getting overflown.
Use safe_emalloc here please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

}
*pvptr-- = sum % BASE;
sum = sum / BASE;
while (pend >= pptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had a faster way than this of writing the BCD representation back. See nielsdos@20783c3#diff-5550a7de511402b266a0fc932bd11298d2e02ef720879507b89651add38f380dR177
But we should keep that as a follow-up imo.

_bc_rec_mul(u1, u1->n_len, v1, v1->n_len, &m1);
/* Convert n2 to uint[] */
i = 0;
while (n2len > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we're doing the same for n1, would be great if this was factored out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

char *pend = pptr + prodlen - 1;
i = 0;
while (i < prod_arr_size - 1) {
for (size_t j = 0; j < BC_MUL_UINT_DIGITS; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also an opportunity for the faster BCD write method for follow-up.


/* Do the multiply */
_bc_rec_mul(n1, len1, n2, len2, &pval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we're going to lose here now is the divide-and-conquer approach for very large numbers, and I think we should probably keep that. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Sorry, I should have explained this in the PR description.

The existing implementation uses divide-and-conquer when the number of digits is 80 or more.

This is compared in test case 2, and clearly this PR is faster, so it's not worth keeping the original implementation as is.

However, combining this PR approach with a divide and conquer approach has shown, in my local testing, to probably be nearly twice as effective.

Therefore, we should use the divide and conquer method (commonly called the Karatsuba algorithm in Japanese), but probably separate PRs.

ext/bcmath/libbcmath/src/recmul.c Outdated Show resolved Hide resolved
SakiTakamachi and others added 2 commits May 15, 2024 08:08
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
SakiTakamachi and others added 3 commits May 15, 2024 19:36
Co-authored-by: Gina Peter Banyard <girgias@php.net>
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's a good thing I allocated time for this, took me a while to think about this.
It looks right except for some small nits.

ext/bcmath/libbcmath/src/recmul.c Outdated Show resolved Hide resolved
*/
static void _bc_rec_mul(bc_num u, size_t ulen, bc_num v, size_t vlen, bc_num *prod)
/*
* If the n_values ​​of n1 and n2 are both 4 (32-bit) or 8 (64-bit) digits or less,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see in my IDE there's zero-width-spaces on this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Google Translate mixed it up. Fixed.

* e.g. 12345678901234567890 => {34567890, 56789012, 1234}
*
* Multiply and add these groups of numbers to perform multiplication fast.
* How much to shift the digits when adding values ​​can be calculated from the index of the array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also zero-width-spaces on this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


bc_num d1 = bc_sub(u1, u0, 0);
bc_num d2 = bc_sub(v0, v1, 0);
BC_UINT_T *buf = safe_emalloc(n1_arr_size + n2_arr_size + prod_arr_size, sizeof(BC_UINT_T), 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a comment to explain why the addition can't overflow in practice.
I.e. let's say that N is the max of n1len and n2len (and a multiple of BC_MUL_UINT_DIGITS for simplicity), then this sum is <= N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS + N/BC_MUL_UINT_DIGITS - 1 which is equal to N - 1 if BC_MUL_UINT_DIGITS is 4, and N/2 - 1 if BC_MUL_UINT_DIGITS is 8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thx!

SakiTakamachi and others added 3 commits May 16, 2024 07:44
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems right. Maybe @Girgias wants to have a final look?

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me

@SakiTakamachi SakiTakamachi merged commit 1d38656 into php:master May 20, 2024
8 of 10 checks passed
@SakiTakamachi SakiTakamachi deleted the refactor_bcmath_mul branch May 20, 2024 11:29
@iluuu1994
Copy link
Member

This broken 32-bit builds, see https://github.com/php/php-src/actions/runs/9183895160/job/25255358600. @SakiTakamachi Please have a look.

@SakiTakamachi
Copy link
Member Author

Okay, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants