Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Elasticsearch to 8 #33135

Merged
merged 7 commits into from Aug 8, 2023

Conversation

Owen-CH-Leung
Copy link
Contributor

@Owen-CH-Leung Owen-CH-Leung commented Aug 5, 2023

Fixes #31256

Upgrade the elasticsearch provider to use official elasticsearch package with version >8.

Remote logging with latest ES package should also work with this PR merged. To test locally, rebuild the breeze image with elasticsearch constraint set up elasticsearch>8,<9, set up elasticsearch 8 , kibana & filebeat in docker respectively, with the following changes to the airflow.cfg

[logging]
remote_logging = True

[elasticsearch]
json_format = True
host = http://host.docker.internal:9200
host_field = host.name
offset_field = log.offset

Note that with the latest es package, host field under elasticsearch has to be a full URL starting with http:// or https://. If you just put host.docker.internal:9200 that will not work

Fix failing Test Suit
@Owen-CH-Leung Owen-CH-Leung marked this pull request as ready for review August 5, 2023 09:15
@eladkal eladkal requested review from potiuk and eladkal August 5, 2023 20:44
@@ -2379,7 +2379,7 @@ elasticsearch:
elasticsearch_configs:
description: ~
options:
use_ssl:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In elasticsearch 7, use_ssl is an accepted parameter when constructing ElasticSearch client. See the following source code :

https://github.com/elastic/elasticsearch-py/blob/7.14/elasticsearch/client/__init__.py#L113

However, in elasticsearch 8, it no longer accepts use_ssl parameter. See the following source code:

https://github.com/elastic/elasticsearch-py/blob/8.9/elasticsearch/_sync/client/__init__.py#L129

Therefore to make the testsuite compile with the ES8 , I use http_compress as the argument (which is one of the accepted arguments for constructing ES client

@potiuk
Copy link
Member

potiuk commented Aug 6, 2023

Looks good, but it needs a bit "surrounding changes" and "context".

I think we need to think a bit carefully about the change/migration (configs?) -and soem changelog documentation should be addded explaining what's going on and explaining context.

The Elasticsearch config can. now be moved to provider - this is where it belongs. We should keep the pre-2.7 defaults in wiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning but the whole "elasticsearch" config should be moved to provider.yaml - same as celery, kubernetes and others.

@Owen-CH-Leung
Copy link
Contributor Author

Looks good, but it needs a bit "surrounding changes" and "context".

I think we need to think a bit carefully about the change/migration (configs?) -and soem changelog documentation should be addded explaining what's going on and explaining context.

The Elasticsearch config can. now be moved to provider - this is where it belongs. We should keep the pre-2.7 defaults in wiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning but the whole "elasticsearch" config should be moved to provider.yaml - same as celery, kubernetes and others.

Thanks. I'll some more contexts into changelog soon.

For moving elasticsearch config, do you mean the following lines should be moved to provider.yaml ?

https://github.com/apache/airflow/blob/main/airflow/config_templates/config.yml#L2396-L2410

@pankajkoti pankajkoti self-requested a review August 7, 2023 06:42
@potiuk
Copy link
Member

potiuk commented Aug 7, 2023

For moving elasticsearch config, do you mean the following lines should be moved to provider.yaml ?

https://github.com/apache/airflow/blob/main/airflow/config_templates/config.yml#L2396-L2410

Yes. See for example: #32777 - this is a new thing that has been added in Airflow 2.7 - providers can now contribute their own whole "sections" and they can be removed from generic "airflow" configuration.

Also, you will need to keep old "pre-2.7" defaults in https://github.com/apache/airflow/blob/main/airflow/config_templates/pre_2_7_defaults.cfg for back-compatibility (this one provides the default values for old providers that have not yet moved their config to ). The PR for Hive above does not have it yet as I realised it only afterwards. As you can see elasticsearch_configs is there, but I think - if we decide to also move "elasticsearch" to the provider, we need to add "elasticearch" to the defaults as well before 2.7.0 is released.

I will go ahead and add it now, otherwise some back-compatibility scenarios might not work (old provider assuming default value retrieved might stop working).

potiuk added a commit to potiuk/airflow that referenced this pull request Aug 7, 2023
The elasticsearch group is likely to be moved to elasticsearch
provider. Anticipating that (see apache#33135) we need to move it to
pre-2.7 defaults in order to have back-compatibility for providers
that assume default values to be there.
@Owen-CH-Leung
Copy link
Contributor Author

For moving elasticsearch config, do you mean the following lines should be moved to provider.yaml ?

https://github.com/apache/airflow/blob/main/airflow/config_templates/config.yml#L2396-L2410

Yes. See for example: #32777 - this is a new thing that has been added in Airflow 2.7 - providers can now contribute their own whole "sections" and they can be removed from generic "airflow" configuration.

Also, you will need to keep old "pre-2.7" defaults in https://github.com/apache/airflow/blob/main/airflow/config_templates/pre_2_7_defaults.cfg for back-compatibility (this one provides the default values for old providers that have not yet moved their config to ). The PR for Hive above does not have it yet as I realised it only afterwards. As you can see elasticsearch_configs is there, but I think - if we decide to also move "elasticsearch" to the provider, we need to add "elasticearch" to the defaults as well before 2.7.0 is released.

I will go ahead and add it now, otherwise some back-compatibility scenarios might not work (old provider assuming default value retrieved might stop working).

Thanks. I've added some contexts in CHANGELOG and also moved the config to provider.yaml

@@ -27,6 +27,17 @@
Changelog
---------

5.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5.1.0
6.0.0

Since we are suggesting a breaking change, we need to have a major bump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - will revise accordingly

self.client = elasticsearch.Elasticsearch(host.split(";"), **es_kwargs) # type: ignore[attr-defined]

self.client = elasticsearch.Elasticsearch(host, **es_kwargs) # type: ignore[attr-defined]
# in airflow.cfg, host of elasticsearch has to be http://dockerhostXxxx:9200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know what error do we see if the protocol is not included in the set value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ES 8, when constructing a new ElasticSearch client, a full URL including http / https is expected, and you will get a ValueError when calling the __init__ like below:

image

@@ -292,27 +292,24 @@ def es_read(self, log_id: str, offset: int | str, metadata: dict) -> list | Elas
}

try:
max_log_line = self.client.count(index=self.index_patterns, body=query)["count"]
max_log_line = self.client.count(index=self.index_patterns, body=query)["count"] # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have here a type ignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - if we look at the official ES package, the count API from the ElasticSearch class actually didn't have the body as parameter. See this link: https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/_sync/client/__init__.py#L801

But the body parameter is still accepted because there's a decorator at the beginning, which modifies the function to accept body as the argument. See this few lines : https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/_sync/client/__init__.py#L798-L800

Therefore, without type ignore, the pre-commit job mypy at provider will actually get failed because it thinks that body is not an accepted parameter (which actually is)


logs: list[Any] | ElasticSearchResponse = []
if max_log_line != 0:
try:
query.update({"sort": [self.offset_field]})
res = self.client.search(
res = self.client.search( # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question regarding type ignore

index=self.index_patterns,
body=query,
size=self.MAX_LINE_PER_PAGE,
from_=self.MAX_LINE_PER_PAGE * self.PAGE,
)
logs = ElasticSearchResponse(self, res)
except elasticsearch.exceptions.ElasticsearchException:
self.log.exception("Could not read log with log_id: %s", log_id)
except Exception as err:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we not narrow down the exception we catch? Is the previous exception no longer present? If so, have they added any other similar class and can we use that?

Having such a broad level exception catch and not re-raising it might lead to some silent failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the exception elasticsearch.exceptions.ElasticsearchException is no longer present. Instead new class of exceptions are defined such as UnsupportedProductError, NotFoundError and so on. See this file:

https://github.com/elastic/elasticsearch-py/blob/main/elasticsearch/exceptions.py

And I feel like all those errors can occur when calling the ES API. So maybe we should raise the exception after logging to the error log ?

~~~~~~~~~~~~~~~~

.. note::
Upgrade to ElasaticSearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use ElasticSearch 8 package.
Copy link
Member

@pankajkoti pankajkoti Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Upgrade to ElasaticSearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use ElasticSearch 8 package.
Upgrade to Elasticsearch 8. The ElasticsearchTaskHandler & ElasticsearchSQLHook will now use Elasticsearch 8 package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will revise accordingly

Comment on lines 38 to 39
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with
ElasticSearch database that is below version 8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with
ElasticSearch database that is below version 8.
This also means that the release drops support for Elasticsearch 7 and below and will no longer work with
Elasticsearch database that is below version 8.

Comment on lines 38 to 39
This also means that the release drops support for ElasticSearch 7 and below and will no longer work with
ElasticSearch database that is below version 8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this PR is a breaking change?
Airflow wise what exactly gets broken here?
Updating upstream package version by itself is not a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - what I'm trying to say is that this release will drop support for ElasticSearch database version 7 or below since elasticsearch 8 will not work with DB version below 8 (i.e. airflow users should also upgrade their ES database to version 8 to ensure compatibility). I wrote this because I read the following doc and learn that Elasticsearch language clients are only backwards compatible with default distributions and without guarantees made

https://elasticsearch-py.readthedocs.io/en/stable/

Airflow-wise, nothing should really get broken

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. @eladkal is right - dependency change is not a reason for major bump - even if it is such a "big" dependency. This very clear in https://semver.org/#what-should-i-do-if-i-update-my-own-dependencies-without-changing-the-public-api:

What should I do if I update my own dependencies without changing the public API?

That would be considered compatible since it does not affect the public API. Software that explicitly depends on the same dependencies as your package should have their own dependency specifications and the author will notice any conflicts. Determining whether the change is a patch level or minor level modification depends on whether you updated your dependencies in order to fix a bug or introduce new functionality. We would usually expect additional code for the latter instance, in which case it’s obviously a minor level increment.

potiuk added a commit that referenced this pull request Aug 7, 2023
The elasticsearch group is likely to be moved to elasticsearch
provider. Anticipating that (see #33135) we need to move it to
pre-2.7 defaults in order to have back-compatibility for providers
that assume default values to be there.
@potiuk
Copy link
Member

potiuk commented Aug 7, 2023

Thanks. I've added some contexts in CHANGELOG and also moved the config to provider.yaml

How about moving "elasticsearch" section as well? I seems very much like it should be in, this section makes no sense when elasticsearch provider is not installed I think?

@@ -72,3 +73,20 @@ connection-types:

logging:
- airflow.providers.elasticsearch.log.es_task_handler.ElasticsearchTaskHandler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of also moving [elasticsearch] section (which I thin makes sense) it should be accompanied by adding the link to elastisearch provider configuration in the Airflow documentation: (following #32777).

When we have more of those, we might want to choose to do it automatically but for now we need to add it "manually"

… add configurations-ref.rst, and revise the doc
@Owen-CH-Leung
Copy link
Contributor Author

Thanks all. I've moved elasticsearch section to provider.yaml, revise the descriptions in CHANGELOG and also revised the docs. For now I keep the release version as 5.1.0 but feel free to let me know if we want to make it a major bump

@potiuk
Copy link
Member

potiuk commented Aug 8, 2023

Looks cool from my side. Thanks @Owen-CH-Leung for being so responsive to our comments :)

Comment on lines 31 to 34
.....

Misc
~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove above including the version.
It is not known yet which version is going to be next.. it could be 5.0.1, 5.1.0, 6.0.0
we know the release version only in run time.

The change we need here is just the note after the changelog title.. during release the release manager set it to the proper classification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Removed.

@Owen-CH-Leung
Copy link
Contributor Author

I think @pankajkoti is also testing on ES remote logging with ES 8. Let's merge the PR when he comes back to us

@eladkal eladkal requested a review from pankajkoti August 8, 2023 13:35
@pankajkoti
Copy link
Member

I can confirm ES remote logging works fine for me from this branch and Elasticsearch 8.9.0 🎉 . Thanks for your patience @Owen-CH-Leung

@eladkal
Copy link
Contributor

eladkal commented Aug 8, 2023

Cool!
Will merge when CI is green

@potiuk
Copy link
Member

potiuk commented Aug 8, 2023

I can confirm ES remote logging works fine for me from this branch and Elasticsearch 8.9.0 🎉 . Thanks for your patience @Owen-CH-Leung

Fantastic news !

@potiuk potiuk merged commit ad9d8d4 into apache:main Aug 8, 2023
61 checks passed
@potiuk
Copy link
Member

potiuk commented Aug 8, 2023

aaand merged.

ephraimbuddy pushed a commit that referenced this pull request Aug 8, 2023
The elasticsearch group is likely to be moved to elasticsearch
provider. Anticipating that (see #33135) we need to move it to
pre-2.7 defaults in order to have back-compatibility for providers
that assume default values to be there.

(cherry picked from commit 2d74604)
@ephraimbuddy ephraimbuddy added the type:misc/internal Changelog: Misc changes that should appear in change log label Aug 27, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Aug 27, 2023
@potiuk potiuk modified the milestones: Airflow 2.8.0, Airflow 2.7.1 Sep 1, 2023
ephraimbuddy pushed a commit that referenced this pull request Sep 2, 2023
(cherry picked from commit ad9d8d4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:providers provider:elasticsearch type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate elasticsearch client to 8.* (or latest 7.*)
5 participants