
Originally Posted by
reyhan
I have a table in which a VARCHAR(255) field could have "-" dahs char used as a blank space because it reports chemical compounds retrieved from external DBs. The same compounds could have been written without dash char(s) and a blank space instead of them or a subset of them. For instance, the compound "2-hydroxy-propionic acid" could have been written
- "2-hydroxy propionic acid"
- "2 hydroxy-propionic acid"
- "2 hydroxy propionic acid"
All of them are proper writings and they refer to the same compound. Is there an elegant way to SELECT the stored field "2-hydroxy-propionic acid" if the search string is one of the other 3? That is just an example, the dash char could be just one or even more than 2.
Hi reyhan, and welcome to the forums! The answer to your question depends on several factors, so I’ve provided some different potential solutions for you below. If you can specify what environment you’re working in (what programming language, what database type), you may be able to get more specific advice. Are you trying to build a system that will allow people to search for compounds? If not, use a query with regex, such as what James provided.
Options:
- Modify the contents of your database [most efficient]
- Modify the contents of your database, but leave the original values intact, and add an index on the [only use if you want the original formatting of the compound to be available]
- Perform a more complex lookup query [least efficient]
Option 1: Modify the database
If it is possible for you to do so, modifying the database will probably give you the best performance possible.
Step 1. Modify every entry in your database to standardise the format of the entries.
Step 2. When a user searches, standardise the format of their query and search the database.
This approach has the benefit of a query that is much more efficient. If you need to update your database to add more values in the future, you can standardise them as they are inserted.
Option 2: Modify the database, but don’t change the value of your VARCHAR field
It may be the case that you can’t (or don’t want) to standardise the values in your VARCHAR field. In this case, I would recommend adding another column with the standardised value and adding an index on that column.
For example, if your existing table was in the format
Code:
┌────────┬───────┬──────────────────────────────┐
│ Keys │ ID │ Compounds │
├────────┼───────┼──────────────────────────────┤
│ Values │ 1 │ 2-hydroxy propionic acid │
│ ├───────┼──────────────────────────────┤
│ │ 2 │ 2-Hydroxybutyric-acid │
│ ├───────┼──────────────────────────────┤
│ │ 3 │ 4-Hydroxybenzaldehyde │
│ ├───────┼──────────────────────────────┤
│ │ 4 │ 3-ethylhexane │
└────────┴───────┴──────────────────────────────┘
You could add a new column with the standardised version, add an index on that column, and search for that value. Standardisation could consist of anything you deem appropriate, such as converting all characters to lowercase, leading/trailing/double spaces, and replacing spaces with hyphens. This would increase your data size a bit.
E.g.,
Code:
┌────────┬───────┬─────────────────────────────┬─────────────────────────────┐
│ Keys │ ID │ Compounds │ Standardised │
├────────┼───────┼─────────────────────────────┼─────────────────────────────┤
│ Values │ 1 │ 2-hydroxy propionic acid │ 2-hydroxy-propionic-acid │
│ ├───────┼─────────────────────────────┼─────────────────────────────┤
│ │ 2 │ 2-Hydroxybutyric-acid │ 2-hydroxybutyric-acid │
│ ├───────┼─────────────────────────────┼─────────────────────────────┤
│ │ 3 │ 4-Hydroxybenzaldehyde │ 4-hydroxybenzaldehyde │
│ ├───────┼─────────────────────────────┼─────────────────────────────┤
│ │ 4 │ 3-ethylhexane │ 3-ethylhexane │
└────────┴───────┴─────────────────────────────┴─────────────────────────────┘
Option 3: Don’t modify your data, and perform a more complex query
If you can’t standardise your data, you may be able to use just an SQL query. However, this will be much more computationally expensive. The exact structure of that command will depend on what type of database you’re using.
Step 1. The user enters their search term
Step 2. You standardise their search term (2-hydroxy propionic acid ⟶ 2-hydroxy-propionic-acid)
Step 3. You search the database using the standardised value
SELECT * FROM compounds WHERE REPLACE(compound, ' ', '-') = ‘2-hydroxy-propionic-acid’
However, this is very inefficient as REPLACE will have to be executed against every value in the database, every time a search is performed. If your database is small, it will probably be fine. If not, you could also consider adding a substring index using the first portion of the compound name.
Bookmarks