Results 1 to 3 of 3

Thread: SELECT query on string field that treat dash char - as a space

  1. #1
    Join Date
    Apr 2021
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default SELECT query on string field that treat dash char - as a space

    I have a table in which a VARCHAR(255) field could have "-" dahs char used as a blank space because it reports chemical compounds retrieved from external DBs. The same compounds could have been written without dash char(s) and a blank space instead of them or a subset of them. For instance, the compound "2-hydroxy-propionic acid" could have been written a4

    1. "2-hydroxy propionic acid"
    2. "2 hydroxy-propionic acid"
    3. "2 hydroxy propionic acid"



    All of them are proper writings and they refer to the same compound. Is there an elegant way to SELECT the stored field "2-hydroxy-propionic acid" if the search string is one of the other 3? That is just an example, the dash char could be just one or even more than 2.
    Last edited by reyhan; 04-30-2021 at 03:18 PM.

  2. #2
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,405
    Thanks
    101
    Thanked 114 Times in 112 Posts

    Default

    There are a few ways to do this, but the best that I can think of is with a simple regular expression:

    Code:
    SELECT ID, chemical from table WHERE summary REGEXP '2( |-)hydroxy( |-)propionic( |-)acid'
    To choose the lesser of two evils is still to choose evil. My personal site

  3. #3
    Join Date
    Mar 2011
    Posts
    2,169
    Thanks
    59
    Thanked 120 Times in 116 Posts
    Blog Entries
    4

    Default

    Quote Originally Posted by reyhan View Post
    I have a table in which a VARCHAR(255) field could have "-" dahs char used as a blank space because it reports chemical compounds retrieved from external DBs. The same compounds could have been written without dash char(s) and a blank space instead of them or a subset of them. For instance, the compound "2-hydroxy-propionic acid" could have been written

    1. "2-hydroxy propionic acid"
    2. "2 hydroxy-propionic acid"
    3. "2 hydroxy propionic acid"



    All of them are proper writings and they refer to the same compound. Is there an elegant way to SELECT the stored field "2-hydroxy-propionic acid" if the search string is one of the other 3? That is just an example, the dash char could be just one or even more than 2.

    Hi reyhan, and welcome to the forums! The answer to your question depends on several factors, so I’ve provided some different potential solutions for you below. If you can specify what environment you’re working in (what programming language, what database type), you may be able to get more specific advice. Are you trying to build a system that will allow people to search for compounds? If not, use a query with regex, such as what James provided.

    Options:
    1. Modify the contents of your database [most efficient]
    2. Modify the contents of your database, but leave the original values intact, and add an index on the [only use if you want the original formatting of the compound to be available]
    3. Perform a more complex lookup query [least efficient]


    Option 1: Modify the database
    If it is possible for you to do so, modifying the database will probably give you the best performance possible.
    Step 1. Modify every entry in your database to standardise the format of the entries.
    Step 2. When a user searches, standardise the format of their query and search the database.

    This approach has the benefit of a query that is much more efficient. If you need to update your database to add more values in the future, you can standardise them as they are inserted.


    Option 2: Modify the database, but don’t change the value of your VARCHAR field
    It may be the case that you can’t (or don’t want) to standardise the values in your VARCHAR field. In this case, I would recommend adding another column with the standardised value and adding an index on that column.
    For example, if your existing table was in the format

    Code:
    ┌────────┬───────┬──────────────────────────────┐
    │ Keys   │  ID   │   Compounds                  │
    ├────────┼───────┼──────────────────────────────┤
    │ Values │  1    │   2-hydroxy propionic acid   │
    │        ├───────┼──────────────────────────────┤
    │        │  2    │   2-Hydroxybutyric-acid      │
    │        ├───────┼──────────────────────────────┤
    │        │  3    │   4-Hydroxybenzaldehyde      │
    │        ├───────┼──────────────────────────────┤
    │        │  4    │   3-ethylhexane              │
    └────────┴───────┴──────────────────────────────┘

    You could add a new column with the standardised version, add an index on that column, and search for that value. Standardisation could consist of anything you deem appropriate, such as converting all characters to lowercase, leading/trailing/double spaces, and replacing spaces with hyphens. This would increase your data size a bit.
    E.g.,
    Code:
    ┌────────┬───────┬─────────────────────────────┬─────────────────────────────┐
    │ Keys   │  ID   │   Compounds                 │   Standardised              │
    ├────────┼───────┼─────────────────────────────┼─────────────────────────────┤
    │ Values │  1    │   2-hydroxy propionic acid  │   2-hydroxy-propionic-acid  │
    │        ├───────┼─────────────────────────────┼─────────────────────────────┤
    │        │  2    │   2-Hydroxybutyric-acid     │   2-hydroxybutyric-acid     │
    │        ├───────┼─────────────────────────────┼─────────────────────────────┤
    │        │  3    │   4-Hydroxybenzaldehyde     │   4-hydroxybenzaldehyde     │
    │        ├───────┼─────────────────────────────┼─────────────────────────────┤
    │        │  4    │   3-ethylhexane             │   3-ethylhexane             │
    └────────┴───────┴─────────────────────────────┴─────────────────────────────┘

    Option 3: Don’t modify your data, and perform a more complex query
    If you can’t standardise your data, you may be able to use just an SQL query. However, this will be much more computationally expensive. The exact structure of that command will depend on what type of database you’re using.

    Step 1. The user enters their search term
    Step 2. You standardise their search term (2-hydroxy propionic acid ⟶ 2-hydroxy-propionic-acid)
    Step 3. You search the database using the standardised value


    SELECT * FROM compounds WHERE REPLACE(compound, ' ', '-') = ‘2-hydroxy-propionic-acid’

    However, this is very inefficient as REPLACE will have to be executed against every value in the database, every time a search is performed. If your database is small, it will probably be fine. If not, you could also consider adding a substring index using the first portion of the compound name.

  4. The Following User Says Thank You to keyboard For This Useful Post:

    james438 (04-27-2021)

Similar Threads

  1. Replies: 2
    Last Post: 06-24-2013, 05:03 PM
  2. replace the second char in a string.
    By james438 in forum PHP
    Replies: 6
    Last Post: 02-27-2008, 06:22 PM
  3. Treat string as a function
    By blm126 in forum JavaScript
    Replies: 4
    Last Post: 10-19-2006, 03:59 AM
  4. Using a query string to change css 'ID' name
    By jim28100 in forum JavaScript
    Replies: 1
    Last Post: 10-13-2006, 08:02 AM
  5. using url like a php query string
    By Agrajag in forum JavaScript
    Replies: 8
    Last Post: 05-27-2006, 06:15 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •