Results 1 to 5 of 5

Thread: php string language

  1. #1
    Join Date
    Nov 2007
    Posts
    151
    Thanks
    67
    Thanked 0 Times in 0 Posts

    Default php string language

    Hi,

    How can I check if a php string include English characters?
    My site is in hebrew, so I want to check if string is in Hebrew or English.

    Thanks

  2. #2
    Join Date
    Sep 2008
    Location
    Bristol - UK
    Posts
    842
    Thanks
    32
    Thanked 132 Times in 131 Posts

    Default

    There may be a function already made for this, but here's something that might work:

    PHP Code:
    <?php
    $letters 
    = array('a','b','c','d','e','f','g','h',
                    
    'i','j','k','l','m','n','o','p',
                    
    'q','r','s','t','u','v','w','x','y','z');
                                        
    $string 'Atlas';

    $has_english false;

    foreach(
    $letters as $letter) {

        if(
    strpos($string$letter)) {
            
            
    $has_english true;
            break;
            
        }

    }

    echo 
    'The string ';

    echo (
    $has_english) ? 'contains' 'does not contain';

    echo 
    ' English characters.'// Outputs "The string contains English characters."
    ?>
    Not sure if this is what you want, as this will only tell you if the string contains at least one english character, it's not foolproof. I'm also not sure if Hebrew uses any of the characters in the Greek alphabet at any point in the language.

  3. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    Schmoopy, this isn't a major point, but note that the Greek alphabet is a third system (alpha α, beta β, gamma γ, etc). The Latin alphabet or Roman alphabet is used for writing most of the European languages including English. (A third system, the Cyrillic alphabet is used for Slavic languages like Russian. And as a bit of trivia, it's interesting to note that the Greek alphabet is actually the script from which the others are based; and it can actually be linked all the way back to the Phoenician script, which in turn is related to Egyptian hieroglyphs. And in fact, a vast majority of the writing systems in the world, including Hebrew (and Arabic), are derived from Phoenician. The only major writing systems not derived from Phoenician are Chinese (and related systems) and the ancient Mayan system. But that's off topic )

    -----------

    d-machine, Schmoopy's answer will give you a basic check: if it has any English at all, call it English. That's not a guarantee because I am guessing that sometimes a string might contain both. I think that it might be more accurate to check for Hebrew characters (slightly), but the same problem could occur. For that reason, the best answer might be to count all of the characters in the string and compare whether English or Hebrew has more. Even that might not be completely accurate, in a case such as:
    The man said in Hebrew: ".......long Hebrew quote......."

    What is the purpose of this code? I'm guessing you are taking user input and trying to format it correctly on the page using left to right or right to left directions for the HTML elements?
    One option would be to split it up by paragraphs (line breaks) and check the first symbol* of each paragraph to see if it's English or Hebrew.
    (*skipping past punctuation to find a real letter, perhaps)

    There is a complication, at least when dealing with Hebrew. What character encoding are you using? Are you sure it's being used consistently? The best solution will be to use unicode because it is standardized. This will then also allow you to know exactly what each character is. But if you are mixing encodings or using another, then any code designed to check for Hebrew unicode characters won't work at all.

    The most accurate way to check if there is Hebrew, assuming unicode, will be to look to see if the string contains any characters in the unicode range for Hebrew.
    http://en.wikipedia.org/wiki/Unicode...ebrew_alphabet
    You can also use the same method for English if you want, using the unicode range for the letters rather than typing them out individually. Note that you might want to include capital letters in addition to lowercase.


    However, maybe the best solution for all of this will be to use a third party language guesser. A good one is part of google's language API (and google translate):
    http://code.google.com/intl/en-US/apis/language/
    There's some info in this discussion here:
    http://stackoverflow.com/questions/1...-string-in-php
    You can use Javascript as that suggests or you can use PHP with the newer version of the API:
    http://code.google.com/intl/en-US/ap...n_snippets_php

    If you do use google, remember that it might guess any language. So I think the best solution will be to check if it matches Hebrew. If not, assume English.
    Last edited by djr33; 04-02-2011 at 07:14 PM.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #4
    Join Date
    Sep 2008
    Location
    Bristol - UK
    Posts
    842
    Thanks
    32
    Thanked 132 Times in 131 Posts

    Default

    Ah, sorry, I meant to say the (Latin?) alphabet... whichever one is A-Z. I thought it was Greek, but it seems I'm mistaken.

  5. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    I think we all understood what you meant. I was just adding that for clarity. Yes, "Latin" is the most common term for it. (That's still actually a little misleading, though: Latin only had capital letters and not all 26 we use in English, like u and w. And other languages that use the "Latin" alphabet may not have all of those or may have some extras, like š or ° or ▀. Regardless, checking for those 26 will usually tell you that it's some European language, but not much more than that. I think it should be officially called the "English alphabet" but no one actually uses that term. The reason for this is that computer systems were designed with English. So in a broad sense, "Latin" is a better term, but specifically for computer encodings, "English" seems more accurate. That's why a symbol like ▀ might be somewhere completely different in the unicode system, even though in German it's just as "normal" as any other letter, like 's' or 't'.)

    [This is what happens when a linguist answers a programming question, haha. Sorry for being off topic. Some of this actually might be important if you're trying to separate European languages, but if you're just comparing to Hebrew as in this question it's probably mostly irrelevant.]
    Last edited by djr33; 04-02-2011 at 10:26 PM.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •