What is best way to find 3-word groups in text? – PHP

  php

Q(Question):

I am writing a little script that will improve authors writing skills by
finding repeated phrases in the text.

The text of a chapter will average about 10,000 words, however, I could
reduce the size of the files if it is better to do so.

So the idea is to search through a string and find repeats of any 3 or 4 word group.

So if the author has repeated the phrase "then I went" 6 times in the text, then this would be found and highlighted.

I am not sure where to start with this 😮

Maybe it is best to start by converting the string into an array of all the words?


$word_list = explode(" ", $text);

But I still don’t know how the best way to find these repeated 3 or 4 word phrases is.

The other thing I want to provide is a list of all the words used ( maybe I will exclude words like and, the, a, etc) and the number of times they are used.

Any good ideas on how I should proceed ?

Thanks

A(Answer):

maybe using regular expressions?
like (to show the general idea)

// matches 3 or 4 word groups up to 5 letters per word
"#((?:\b\w{1,5}\b\s+){3,4})#"

A(Answer):

Yep,
I guessed it might require regex, but I left the question
open in case there is a method that is less cpu intensive.

Thanks for your example, it will be useful as I am still not all that
good with regex.

What would be the best approach to count up all the different words ?

A(Answer):

@jeddiki

even if there is, what if the follow-up processes eat up that saved memory/workload/whatever?

@jeddiki

get all single words into an array
(lowercase)
array_unique()
count()

A(Answer):

Thanks for the pointers 🙂

I will follow them up and get some code down.

A(Answer):

Hi,

I have been playing about with the resulting word list for a while but ı can not work out how to get the number times the words occur in an array.

For example

$words = "Mary Had A Little Lamb and She LOVED It So much she had a fit and killed the lamb. She also loved lamb chops you see";

First I would this:

$words = strtolower($words);
...
$list = explode(" ", $words);

From here what would you recommend I do to get this:

mary 1
little 1
it 1
so 1
much 1
fit 1
killed 1
also 1
chops 1
you 1
see 1

a 2
had 2
and 2
loved 2

lamb 3
she 3

Any ideas ?

A(Answer):

array_count_values() (did I mention that searching the manual is the first step?)

LEAVE A COMMENT