Skip to content

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

Open
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase
Open

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

Conversation

@goel-skd

@goel-skd goel-skd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

  • ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
    mapping (e.g. CAFÉcafé, GROẞEgroße). Invalid UTF-8 is
    returned unchanged rather than erroring.

  • EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
    is case-insensitive for non-ASCII letters too.

  • ToUpper is intentionally left ASCII-only — it is not used for name
    matching.

  • utf8proc is wired into both the CMake (vendored via FetchContent / system
    package) and Meson (subprojects/utf8proc.wrap) builds.

    Testing

  • Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
    and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
    pass-through).

Closes #613.

Follow-up to #748

@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40
Replace the ASCII-only ToLower with utf8proc simple case mapping so
case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used
for name matching. EqualsIgnoreCase now compares lowercased forms.

Wire utf8proc into both the CMake (vendored/system) and Meson builds.

See apache#613.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13
target_include_directories(utf8proc::utf8proc INTERFACE ${utf8proc_SOURCE_DIR})
endif()

set(UTF8PROC_VENDORED TRUE)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8proc is licensed under the permissive MIT License (along with some Unicode data under a similarly permissive license). We need to update the LICENSE File by adding a separator at the bottom similar to this:

---
This product bundles utf8proc, which is available under the MIT License:
Copyright © 2014-2021 by Steven G. Johnson, Jiahao Chen, Tony Kelman, Jonas Fonseca, and other contributors listed in the git history.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software... (include the rest of the utf8proc MIT license text here)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac. Good catch, let me update it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

case insensitive field matching behavior different from iceberg-python and iceberg-java

2 participants